Fetch Apache GitHub Actions Statistics

Table of Contents

Context and motivation
Statistics
Determining ASF repositories which uses GitHub Actions (matrix.json)
GitHub Actions Secrets:
Google Cloud Platform infrastructure

Context and motivation

For The Apache Software Foundation [ASF] the limit for concurrent jobs in GitHub Actions [GA] equals 180 (usage limits). The GItHub does not provide statistics related to GA and this repo was created to collect some basic data to make it possible.

Statistics

Statistics data is fetched in the scheduled action Fetch GitHub Action queue. This action makes series of "snapshots" of GA workflow runs for every ASF repository which uses GA (list of them is stored in matrix.json, described here).

Statistics consists of:

json files - workflow runs for every repo in seperate files (described here)
csv file - simple statistics in single file (described here)

These files are uploaded as workflow artifact.

Json files

The json files contain list of repository workflow runs in queued and in_progress state. File titles contain timestamp when fetching this list started. The json schema is described in GitHub API documentation here.

CSV file

Single bq.csv file is created and contains simple statistics for all fetched repositories. This file is used in the Fetch GitHub Action queue to efficiently upload data to the BigQuery table.

CSV file headers: repository_owner, repository_name, queued, in_progress, in_progress.

Example content:

repository_owner,repository_name,queued,in_progress,timestamp
apache,airflow,1,3,2020-11-19 17:53:24.139806+00:00
apache,beam,0,1,2020-11-19 17:53:39.171882+00:00

Processing existing json files to csv and pushing it to BigQuery

Helper script scripts/parse_existing_json_files.py can be used to process existing json files into a single csv.

Example use:

gsutil -m cp -r gs://example-bucket-name/apache gcs

python parse_existing_json_files.py \
    --input-dir gcs \
    --output bq_csv.csv

bq load --autodetect \
    --source_format=CSV \
    dataset.table bq_csv.csv

Determining ASF repositories which uses GitHub Actions (matrix.json)

There is no single endpoint to obtain a list of ASF repositories which uses GA and since ASF consists of 2000+ repositories it is not a trivial task to obtain it.

This list of repositories which uses GitHub Actions is stored in matrix.json and can be updated in three ways:

manually editing matrix.json and committing changes
by using fetch_apache_projects_with_ga.py python script and committing changes
automatically by Fetch Apache Repositories with GA action (changes committed automatically when occur).

Running python script and action causes many requests on behalf of used GitHub Access tokens which may cause exceeding quota limits.

GitHub Actions Secrets:

Secret	Required	Description
`PERSONAL_ACCESS_TOKEN`	True	Personal GitHub access token(no need for additional permissions, don't have to select any checkboxes) used to authorize requests. It has bigger quota than `GITHUB_TOKEN secret`.
`GCP_PROJECT_ID`	-	Google Cloud Project ID.
`BQ_TABLE`	-	BigQuery table reference to which simple statistics will be pushed (e.g. `dataset.table`).
`GCP_SA_KEY`	-	Google Cloud Service Account key (Service Account with permissions to Google Cloud Storage and BigQuery).
`GCP_SA_EMAIL`	-	Google Cloud Service Account email (Service Account with permissions to Google Cloud Storage and BigQuery).

Google Cloud Platform infrastructure

All infrastructure components necessary to store statistics in BigQuery were wrapped in ./terraform folder.

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.github/workflows		.github/workflows
scripts		scripts
terraform		terraform
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
README.md		README.md
matrix.json		matrix.json
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
yamllint-config.yml		yamllint-config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fetch Apache GitHub Actions Statistics

Context and motivation

Statistics

Json files

CSV file

Processing existing json files to csv and pushing it to BigQuery

Determining ASF repositories which uses GitHub Actions (matrix.json)

GitHub Actions Secrets:

Google Cloud Platform infrastructure

About

Languages

TobKed/fetch-apache-ga-stats

Folders and files

Latest commit

History

Repository files navigation

Fetch Apache GitHub Actions Statistics

Context and motivation

Statistics

Json files

CSV file

Processing existing json files to csv and pushing it to BigQuery

Determining ASF repositories which uses GitHub Actions (matrix.json)

GitHub Actions Secrets:

Google Cloud Platform infrastructure

About

Topics

Resources

Stars

Watchers

Forks

Languages