-
Notifications
You must be signed in to change notification settings - Fork 0
Methods: GitHub
A whole breadth of information is extracted using the GitHub API. Each of the crawling scripts expects a CSV file with a column that contains the GitHub repository IDs - the name of that column is configurable.
Requests to the GitHub API are subject to rate limiting, and these rate limits reset every hour. As a result, every function in the crawling scripts is wrapped to catch rate limit exceptions and will wait for the reset time if the limit is exceeded. The scripts potentially run for a very long time.
Metadata is extracted directly from the GitHub API. The resulting CSV file will have a row for each repository.
metadata.csv
:
column name | type | description | comment |
---|---|---|---|
github_user_cleaned_url |
str |
GitHub repository ID (user/repo ) |
might have a different name depending on the -n attribute passed to the script |
archived |
bool |
whether the repository was archived or not | |
created_at |
datetime |
repository creation time | |
has_wiki |
bool |
whether the repository has a wiki | |
has_pages |
bool |
whether the repository has GitHub pages |
Contents data is extracted in two ways.
First, we look for specific files, for example license, contributing guidelines and README, and collect simple statistical data about these using the GitHub API.
Moreover, we examine the commit history of some files using pydriller
to determine when they were added.
Next, we examine the commit history of the README file for any changes to the README headings using pydriller
.
Headings are only recognised by the leading #
, which will not be reflective of all possible formats.
Behind the scenes, pydriller
will temporarily clone or download (parts of) the repository.
This can lead to long execution times and memory exceptions.
There are two resulting CSV files.
contents.csv
contains one row for each repository.
readme_history.csv
contains one row for each commit to a repository's README file that modified one or more headings in the README or added a citation to the README.
contents.csv
column name | type | description | comment |
---|---|---|---|
github_user_cleaned_url |
str |
GitHub repository ID (user/repo ) |
might have a different name depending on the -n attribute passed to the script |
license |
str |
license identifier as used by GitHub | can be other if GitHub doesn't recognise the license type, or none if no license file was found |
readme_size |
int |
README file size in bytes | |
readme_path |
str |
path to README file | usually README.md , but can also be README.rst
|
readme_emojis |
int |
number of emojis used in the README file in its current state | |
contributing_size |
int |
CONTRIBUTING.md file size in bytes |
0 if non-existent |
citation_added |
datetime |
time at which a CITATION.cff file was added to the repository |
does not consider variations of the filename |
contributing_added |
datetime |
time at which a CONTRIBUTING.md file was added to the repository |
does not consider variations of the filename |
readme_history.csv
:
column name | type | description | comment |
---|---|---|---|
github_user_cleaned_url |
str |
GitHub repository ID (user/repo ) |
might have a different name depending on the -n attribute passed to the script |
readme_path |
str |
path to README file from contents.csv
|
|
author_date |
datetime |
time of commit | |
added_headings |
list[str] |
list of headings with commit additions | includes the full heading without leading #
|
deleted_headings |
list[str] |
list of headings with commit deletions | includes the full heading without leading #
|
added_cites |
list[str] |
any lines that were modified/added/deleted that contain citation indicators | indicators are ["DOI:", "doi.", "@article", "@misc"]
|
The contribution statistics of each contributor is extracted for each repository. Because of the way the GitHub API reports contributions, there are some limitations to this method (merge commits are not recorded).
The resulting file is contributions.csv
, which contains one row for each commit author and week of repository life for each repository.
column name | type | description | comment |
---|---|---|---|
github_user_cleaned_url |
str |
GitHub repository ID (user/repo ) |
might have a different name depending on the -n attribute passed to the script |
author |
str |
GitHub ID of a contributor to this repository | |
week_co |
datetime |
Monday of a week in the repository's life | |
commits |
int |
number of commits made to the main branch of the repository by author in the week commencing on week_co
|
Forks and stars can be interpreted as repository engagement markers, though the meaningfulness of this is debated. Both resulting files contain a row for each time a repository was starred or forked, respectively. We did not collect data indicating whether the fork was active or had ever received any commits at all.
forks.csv
:
column name | type | description | comment |
---|---|---|---|
github_user_cleaned_url |
str |
GitHub repository ID (user/repo ) |
might have a different name depending on the -n attribute passed to the script |
date |
datetime |
time of fork event | |
user |
str |
GitHub username of the user who forked the repository |
stars.csv
:
column name | type | description | comment |
---|---|---|---|
github_user_cleaned_url |
str |
GitHub repository ID (user/repo ) |
might have a different name depending on the -n attribute passed to the script |
date |
datetime |
time of star event | |
user |
str |
GitHub username of the user who starred the repository |
We do not distinguish between pull requests and issues. The resulting file has a row for each issue in each repository.
issues.csv
:
column name | type | description | comment |
---|---|---|---|
github_user_cleaned_url |
str |
GitHub repository ID (user/repo ) |
might have a different name depending on the -n attribute passed to the script |
state |
str |
one of open or closed
|
|
created_at |
datetime |
issue creation time | |
user |
str |
GitHub username of the issue author | |
closed_at |
datetime |
issue closing time |
None if it is still open |
closed_by |
str |
GitHub username of the user who closed the issue |
None if it is still open |