GitHub Innovation Graph Metrics
Last updated: 2023-09-21
Version: 1.0.0
The GitHub Policy Team manages the dataset. Inquiries can be made to [email protected]. Please use a subject line that includes “GitHub Innovation Graph.”
The dataset is openly accessible.
The dataset can be accessed at its dedicated repository: github.com/github/innovationgraph
The dataset is composed of 8 CSV files of GitHub metrics, aggregated by economy and reported quarterly. Each metric is reported quarterly dating back to January 2020. Metrics for economies are only reported when there are 100 or more unique developers performing the relevant activity within the time period.
Metrics of activity are assigned to a location based on the relevant user as determined by their IP address when interacting with GitHub. If a user changes locations in the time period, the location for all user-relevant activity would be determined by the mode of location sampled daily in the period. Concretely, if a developer were contributing to open source projects in the United States for two months, but also made contributions while traveling in India, all activity from that developer during that quarter would be assigned to the United States.
Additionally, the last known location of the developer is carried forward on a daily basis even if no activities were performed by the developer that day. For example, if a developer performed activities within the United States and then became inactive for 6 days, that developer would be considered to be in the United States for that 7-day span.
We report on the following metrics:
- Git pushes: the number of times developers in a given economy uploaded code to GitHub. See the documentation for
git push
for a description of thegit push
command. Changes to files made through GitHub’s online platform automatically result in a push. Note that a single git push may contain multiple commits. - Repositories: the number of software projects in a given economy based on the mode location of all repository members with triage and above access. See our documentation for Repositories for more information.
- Developers: the number of developer accounts located in a given economy based on mode daily location. This count excludes users that are bots or otherwise flagged as “spammy” within internal systems. See our documentation for personal accounts for more information.
- Organizations: the number of developer groups in a given economy, including companies, academic groups, nonprofits, and informal collectives that organize activity on GitHub. Location is assigned based on the mode location of all organization members. See our documentation for Organizations for more information.
- Programming languages: the number of unique developers in each economy who made at least one git push to a repository with a given programming language. See our documentation for repository languages for more information about how we detect programming languages.
- Licenses. the number of unique developers in each economy who made at least one git push to a repository with a given license. See our documentation for Licenses for more information about how we classify repositories by license. Note that
NOASSERTION
in the data orOther
(displayed) means a license file was found but could not be identified with high confidence, or multiple licenses were present in a repository. - Topics: the number of unique developers who made at least one git push to a repository with a given topic. See our documentation for Topics for more information about how developers assign topics to repositories.
- Economy collaborators: the volume of collaboration on software projects based on the sum of git pushes sent and pull requests opened by a developer to a repository owned by another developer or organization. See the documentation for
git push
for a description of thegit push
command. See our documentation for Pull Requests and Repositories for more information about supported functionality.
git_pushes | iso2_code | year | quarter |
---|---|---|---|
integer | string | integer | integer |
repositories | iso2_code | year | quarter |
---|---|---|---|
integer | string | integer | integer |
developers | iso2_code | year | quarter |
---|---|---|---|
integer | string | integer | integer |
organizations | iso2_code | year | quarter |
---|---|---|---|
integer | string | integer | integer |
num_pushers | language | language_type | iso2_code | year | quarter |
---|---|---|---|---|---|
integer | string | string | string | integer | integer |
num_pushers | spdx_license | iso2_code | year | quarter |
---|---|---|---|---|
integer | string | string | integer | integer |
num_pushers | topic | iso2_code | year | quarter |
---|---|---|---|---|
integer | string | string | integer | integer |
weight | source | destination | iso2_code | year | quarter |
---|---|---|---|---|---|
integer | string | string | string | integer | integer |
Innovation Graph metrics are intended to support research and public policy that could benefit from data on software development activity globally. We welcome developers to explore the data, discover insights, and create visualizations, among much more.
Innovation Graph metrics are not appropriate for understanding individual projects or individual developers’ activity. The dataset is unlikely to be of use for training specific purpose machine learning systems. It is not intended to replace API queries for specific data needs.
Data is collected in the course of regular developer activity on, and operation of, the GitHub platform. GitHub staff use the data to construct the Innovation Graph metrics.
As described further in the Data Quality section below, we excluded activity from users that were deemed to be automated or inauthentic, and filtered terms that might individually identify users, such as in the case of the Topics metric.
As guided by the GitHub Privacy Statement, we have used platform data to conduct this research. We aggregate individual-level activity and use a reporting threshold of 100 unique developers performing an activity within the time period. If this threshold is not met, we do not report the metric for that economy. The intent is to minimize any privacy risk to individuals. The raw data underlying the public metrics was accessible to only select GitHub employees in the preparation of the data set, and then only under appropriate controls.
11. How representative is this dataset? What population(s), contexts (e.g., scripted vs. conversational speech), conditions (e.g., lighting for images) is it representative of?
Innovation Graph metrics reflect developer activity on GitHub. Although these offer valuable insights into software development, GitHub data alone paints an incomplete picture; we are but one resource in the ecosystem.
No demographic groups are identified in this dataset.
13. Is there any missing information in the dataset? If yes, please explain what information is missing and why (e.g., some people did not report their gender).
Metrics for economies are only reported when there are 100 or more unique developers performing the relevant activity within the time period. This means that economies with small numbers of GitHub developers will be missing. As such, Innovation Graph metrics are not useful for understanding software development in these smaller economies.
Additionally, we excluded developers whose activity volume exceeded a threshold that could be reasonably attributed to human activity and developers who were classified by our automated systems as inauthentic. Thus, Innovation Graph metrics are not useful for understanding the adoption and behavior of bots on GitHub.
We assign the location of repositories and organizations based on the mode location of the members. For the economy collaborator metric, which relies on these locations, this may in effect under-count cross-border collaboration where contributions to a repo may affect users collaborating from many economies, not simply one as calculated in our metric. As such, economy collaborators should be viewed as a lower bar as opposed to a precise measure.
For metrics related to repositories, we only report on numbers and activity related to those that are public. As a result, Innovation Graph metrics are not useful for understanding the volume and activity of software development in private repositories.
GitHub activity is assigned to an economy based on the IP address of the given developer or organization. Thus, VPNs and other means of altering or hiding IP addresses distort the metrics. Economies where developers may be more likely to use such tools will be affected more than others; thus, international comparisons should note this limitation.
GitHub staff manually verified samples of the results for each data query.
16. How can dataset users receive information if this dataset is updated (e.g., corrections, additions, removals)?
If you use this dataset for your research, please watch the GitHub Innovation Graph repository to subscribe to changes.
The dataset will be updated quarterly, with previous files deleted but their contents carried over to the new files. Of course, the previous versions will be available in the git history of this repository.
18. Describe any applicable intellectual property (IP) licenses, copyright, fees, terms of use, export controls, or other regulatory restrictions that apply to this dataset or individual data points.
The dataset is made available under a CC0-1.0 license.