Skip to content

Methods: GitHub

Kara Moraw edited this page Aug 14, 2024 · 4 revisions

A whole breadth of information is extracted using the GitHub API. Each of the crawling scripts expects a CSV file with a column that contains the GitHub repository IDs - the name of that column is configurable.

Requests to the GitHub API are subject to rate limiting, and these rate limits reset every hour. As a result, every function in the crawling scripts is wrapped to catch rate limit exceptions and will wait for the reset time if the limit is exceeded. The scripts potentially run for a very long time.

Metadata

Metadata is extracted directly from the GitHub API. The resulting CSV file will have a row for each repository.

metadata.csv:

column name type description comment
github_user_cleaned_url str GitHub repository ID (user/repo) might have a different name depending on the -n attribute passed to the script
archived bool whether the repository was archived or not
created_at datetime repository creation time
has_wiki bool whether the repository has a wiki
has_pages bool whether the repository has GitHub pages

Contents

Contents data is extracted in two ways. First, we look for specific files, for example license, contributing guidelines and README, and collect simple statistical data about these using the GitHub API. Moreover, we examine the commit history of some files using pydriller to determine when they were added. Next, we examine the commit history of the README file for any changes to the README headings using pydriller. Headings are only recognised by the leading #, which will not be reflective of all possible formats. Behind the scenes, pydriller will temporarily clone or download (parts of) the repository. This can lead to long execution times and memory exceptions.

There are two resulting CSV files. contents.csv contains one row for each repository. readme_history.csv contains one row for each commit to a repository's README file that modified one or more headings in the README or added a citation to the README.

contents.csv

column name type description comment
github_user_cleaned_url str GitHub repository ID (user/repo) might have a different name depending on the -n attribute passed to the script
license str license identifier as used by GitHub can be other if GitHub doesn't recognise the license type, or none if no license file was found
readme_size int README file size in bytes
readme_path str path to README file usually README.md, but can also be README.rst
readme_emojis int number of emojis used in the README file in its current state
contributing_size int CONTRIBUTING.md file size in bytes 0 if non-existent
citation_added datetime time at which a CITATION.cff file was added to the repository does not consider variations of the filename
contributing_added datetime time at which a CONTRIBUTING.md file was added to the repository does not consider variations of the filename

readme_history.csv:

column name type description comment
github_user_cleaned_url str GitHub repository ID (user/repo) might have a different name depending on the -n attribute passed to the script
readme_path str path to README file from contents.csv
author_date datetime time of commit
added_headings list[str] list of headings with commit additions includes the full heading without leading #
deleted_headings list[str] list of headings with commit deletions includes the full heading without leading #
added_cites list[str] any lines that were modified/added/deleted that contain citation indicators indicators are ["DOI:", "doi.", "@article", "@misc"]

Contributions

The contribution statistics of each contributor is extracted for each repository. Because of the way the GitHub API reports contributions, there are some limitations to this method (merge commits are not recorded).

The resulting file is contributions.csv, which contains one row for each commit author and week of repository life for each repository.

column name type description comment
github_user_cleaned_url str GitHub repository ID (user/repo) might have a different name depending on the -n attribute passed to the script
author str GitHub ID of a contributor to this repository
week_co datetime Monday of a week in the repository's life
commits int number of commits made to the main branch of the repository by author in the week commencing on week_co

Engagement

Forks and stars can be interpreted as repository engagement markers, though the meaningfulness of this is debated. Both resulting files contain a row for each time a repository was starred or forked, respectively. We did not collect data indicating whether the fork was active or had ever received any commits at all.

forks.csv:

column name type description comment
github_user_cleaned_url str GitHub repository ID (user/repo) might have a different name depending on the -n attribute passed to the script
date datetime time of fork event
user str GitHub username of the user who forked the repository

stars.csv:

column name type description comment
github_user_cleaned_url str GitHub repository ID (user/repo) might have a different name depending on the -n attribute passed to the script
date datetime time of star event
user str GitHub username of the user who starred the repository

Issues

We do not distinguish between pull requests and issues. The resulting file has a row for each issue in each repository.

issues.csv:

column name type description comment
github_user_cleaned_url str GitHub repository ID (user/repo) might have a different name depending on the -n attribute passed to the script
state str one of open or closed
created_at datetime issue creation time
user str GitHub username of the issue author
closed_at datetime issue closing time None if it is still open
closed_by str GitHub username of the user who closed the issue None if it is still open
Clone this wiki locally