Methods: GitHub

A whole breadth of information is extracted using the GitHub API. Each of the crawling scripts expects a CSV file with a column that contains the GitHub repository IDs - the name of that column is configurable.

Requests to the GitHub API are subject to rate limiting, and these rate limits reset every hour. As a result, every function in the crawling scripts is wrapped to catch rate limit exceptions and will wait for the reset time if the limit is exceeded. The scripts potentially run for a very long time.

Metadata

Metadata is extracted directly from the GitHub API. The resulting CSV file will have a row for each repository.

metadata.csv:

column name	type	description	comment
`github_user_cleaned_url`	`str`	GitHub repository ID (`user/repo`)	might have a different name depending on the `-n` attribute passed to the script
`archived`	`bool`	whether the repository was archived or not
`created_at`	`datetime`	repository creation time
`has_wiki`	`bool`	whether the repository has a wiki
`has_pages`	`bool`	whether the repository has GitHub pages

Contents data is extracted in two ways. First, we look for specific files, for example license, contributing guidelines and README, and collect simple statistical data about these using the GitHub API. Moreover, we examine the commit history of some files using pydriller to determine when they were added. Next, we examine the commit history of the README file for any changes to the README headings using pydriller. Headings are only recognised by the leading #, which will not be reflective of all possible formats. Behind the scenes, pydriller will temporarily clone or download (parts of) the repository. This can lead to long execution times and memory exceptions.

There are two resulting CSV files. contents.csv contains one row for each repository. readme_history.csv contains one row for each commit to a repository's README file that modified one or more headings in the README or added a citation to the README.

contents.csv

column name	type	description	comment
`github_user_cleaned_url`	`str`	GitHub repository ID (`user/repo`)	might have a different name depending on the `-n` attribute passed to the script
`license`	`str`	license identifier as used by GitHub	can be `other` if GitHub doesn't recognise the license type, or `none` if no license file was found
`readme_size`	`int`	README file size in bytes
`readme_path`	`str`	path to README file	usually `README.md`, but can also be `README.rst`
`readme_emojis`	`int`	number of emojis used in the README file in its current state
`contributing_size`	`int`	`CONTRIBUTING.md` file size in bytes	0 if non-existent
`citation_added`	`datetime`	time at which a `CITATION.cff` file was added to the repository	does not consider variations of the filename
`contributing_added`	`datetime`	time at which a `CONTRIBUTING.md` file was added to the repository	does not consider variations of the filename

readme_history.csv:

column name	type	description	comment
`github_user_cleaned_url`	`str`	GitHub repository ID (`user/repo`)	might have a different name depending on the `-n` attribute passed to the script
`readme_path`	`str`	path to README file from `contents.csv`
`author_date`	`datetime`	time of commit
`added_headings`	`list[str]`	list of headings with commit additions	includes the full heading without leading `#`
`deleted_headings`	`list[str]`	list of headings with commit deletions	includes the full heading without leading `#`
`added_cites`	`list[str]`	any lines that were modified/added/deleted that contain citation indicators	indicators are `["DOI:", "doi.", "@article", "@misc"]`

Contributions

The contribution statistics of each contributor is extracted for each repository. Because of the way the GitHub API reports contributions, there are some limitations to this method (merge commits are not recorded).

The resulting file is contributions.csv, which contains one row for each commit author and week of repository life for each repository.

column name	type	description	comment
`github_user_cleaned_url`	`str`	GitHub repository ID (`user/repo`)	might have a different name depending on the `-n` attribute passed to the script
`author`	`str`	GitHub ID of a contributor to this repository
`week_co`	`datetime`	Monday of a week in the repository's life
`commits`	`int`	number of commits made to the main branch of the repository by `author` in the week commencing on `week_co`

Engagement

Forks and stars can be interpreted as repository engagement markers, though the meaningfulness of this is debated. Both resulting files contain a row for each time a repository was starred or forked, respectively. We did not collect data indicating whether the fork was active or had ever received any commits at all.

forks.csv:

column name	type	description	comment
`github_user_cleaned_url`	`str`	GitHub repository ID (`user/repo`)	might have a different name depending on the `-n` attribute passed to the script
`date`	`datetime`	time of fork event
`user`	`str`	GitHub username of the user who forked the repository

stars.csv:

column name	type	description	comment
`github_user_cleaned_url`	`str`	GitHub repository ID (`user/repo`)	might have a different name depending on the `-n` attribute passed to the script
`date`	`datetime`	time of star event
`user`	`str`	GitHub username of the user who starred the repository

Issues

We do not distinguish between pull requests and issues. The resulting file has a row for each issue in each repository.

issues.csv:

column name	type	description	comment
`github_user_cleaned_url`	`str`	GitHub repository ID (`user/repo`)	might have a different name depending on the `-n` attribute passed to the script
`state`	`str`	one of `open` or `closed`
`created_at`	`datetime`	issue creation time
`user`	`str`	GitHub username of the issue author
`closed_at`	`datetime`	issue closing time	`None` if it is still open
`closed_by`	`str`	GitHub username of the user who closed the issue	`None` if it is still open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methods: GitHub

Metadata

Contents

Contributions

Engagement

Issues

Clone this wiki locally