Table of Contents
- Context and motivation
- Statistics
- Determining ASF repositories which uses GitHub Actions (matrix.json)
- GitHub Actions Secrets:
- Google Cloud Platform infrastructure
For The Apache Software Foundation [ASF] the limit for concurrent jobs in GitHub Actions [GA] equals 180 (usage limits). The GItHub does not provide statistics related to GA and this repo was created to collect some basic data to make it possible.
Statistics data is fetched in the scheduled action Fetch GitHub Action queue. This action makes series of "snapshots" of GA workflow runs for every ASF repository which uses GA (list of them is stored in matrix.json, described here).
Statistics consists of:
- json files - workflow runs for every repo in seperate files (described here)
- csv file - simple statistics in single file (described here)
These files are uploaded as workflow artifact.
The json files contain list of repository workflow runs in queued
and in_progress
state.
File titles contain timestamp when fetching this list started.
The json schema is described in GitHub API documentation here.
Single bq.csv
file is created and contains simple statistics for all fetched repositories.
This file is used in the Fetch GitHub Action queue
to efficiently upload data to the BigQuery table.
CSV file headers: repository_owner
, repository_name
, queued
, in_progress
, in_progress
.
Example content:
repository_owner,repository_name,queued,in_progress,timestamp
apache,airflow,1,3,2020-11-19 17:53:24.139806+00:00
apache,beam,0,1,2020-11-19 17:53:39.171882+00:00
Helper script scripts/parse_existing_json_files.py
can be used to process existing json files into a single csv.
Example use:
gsutil -m cp -r gs://example-bucket-name/apache gcs
python parse_existing_json_files.py \
--input-dir gcs \
--output bq_csv.csv
bq load --autodetect \
--source_format=CSV \
dataset.table bq_csv.csv
There is no single endpoint to obtain a list of ASF repositories which uses GA and since ASF consists of 2000+ repositories it is not a trivial task to obtain it.
This list of repositories which uses GitHub Actions is stored in matrix.json and can be updated in three ways:
- manually editing
matrix.json
and committing changes - by using fetch_apache_projects_with_ga.py python script and committing changes
- automatically by Fetch Apache Repositories with GA action (changes committed automatically when occur).
Running python script and action causes many requests on behalf of used GitHub Access tokens which may cause exceeding quota limits.
Secret | Required | Description |
---|---|---|
PERSONAL_ACCESS_TOKEN |
True | Personal GitHub access token(no need for additional permissions, don't have to select any checkboxes) used to authorize requests. It has bigger quota than GITHUB_TOKEN secret . |
GCP_PROJECT_ID |
- | Google Cloud Project ID. |
BQ_TABLE |
- | BigQuery table reference to which simple statistics will be pushed (e.g. dataset.table ). |
GCP_SA_KEY |
- | Google Cloud Service Account key (Service Account with permissions to Google Cloud Storage and BigQuery). |
GCP_SA_EMAIL |
- | Google Cloud Service Account email (Service Account with permissions to Google Cloud Storage and BigQuery). |
All infrastructure components necessary to store statistics in BigQuery were wrapped in ./terraform folder.