GHTorrent: GitHub’s Data from a Firehose

by Gousios, Georgios and Spinellis, Diomidis

edited by Godfrey, Michael W. and Whitehead, Jim

You can get a pre-print version from here.
You can view the publisher's page here.
See the paper's associated code repository: gousiosg/github-mirror


A common requirement of many empirical software engineering studies is the acquisition and curation of data from software repositories. During the last few years, GitHub has emerged as a popular project hosting, mirroring and collaboration platform. GitHub provides an extensive REST API, which enables researchers to retrieve both the commits to the projects’ repositories and events generated through user actions on project resources. GHTorrent aims to create a scalable off line mirror of GitHub’s event streams and persistent data, and offer it to the research community as a service. In this paper, we present the project’s design and initial implementation and demonstrate how the provided datasets can be queried and processed.

Bibtex record

  author = {Gousios, Georgios and Spinellis, Diomidis},
  booktitle = { {MSR} '12: Proceedings of the 9th Working Conference on Mining Software Repositories},
  editor = {Godfrey, Michael W. and Whitehead, Jim},
  location = {Zurich, Switzerland},
  month = jun,
  pages = {12--21},
  publisher = {IEEE},
  title = { {GHT}orrent: {G}it{H}ub's Data from a Firehose},
  year = {2012},
  doi = {10.1109/MSR.2012.6224294},
  issn = {2160-1852},
  slideshareembed = {13184524},
  url = {/pub/ghtorrent-githubs-data-from-a-firehose.pdf},
  github = {gousiosg/github-mirror}


The paper