A python CLI to setup a UK Biobank (UKB) project folder.
Important: This CLI is only useful for UKB-approved KCL reasearchers and their collaborators, with an account on the Rosalind or CREATE HPC clusters.
Contents:
- Installation
- Use
2.1 Setup a project directory
2.2 Download UKB utilities
2.3 Include project data
2.4 Munge the UKB data
2.5 Add symlinks to sample information and relatedness files - Access the data with ukbkings
- Additional withdrawals
- Updates to phenotype data
Clone the github repo
git clone https://github.com/kenhanscombe/ukbproject.git
Change into the ukbproject directory, make munge.py executable, and copy the snakemake SLURM profile (replace <username>
with your KCL username).
cd ukbproject
chmod +x ukbproject/munge.py
mkdir -p /users/<username>/.config/snakemake
cp -R ukbproject/conf/slurm /users/<username>/.config/snakemake
KCL Rosalind users load the default python3 module
module avail python3
module load <python3_module>
KCL CREATE users load conda and python3
module spider conda
module load <conda_module>
module spider python
module load <python3_module>
You may be prompted to do a one-time git init <SHELL_NAME>
. For bash
conda init bash
Reload the terminal or run source ~/.bashrc
. Create the conda
environment activate it and install the ukbproject package into it.
conda env create -f ukbproject/conf/environment.yml
conda activate ukbproject
python3 -m pip install --editable .
After use (below), exit the environment with conda deactivate
. To use
the prj
CLI on subsequent occasions, simply activate the environment
conda activate ukbproject
.
For help
prj --help
Usage: prj [OPTIONS] COMMAND [ARGS]... Sets up a UKB project on Rosalind/CREATE storing common data and utilities in the parent directory, at resources/ and bin/ respectively. Options: --version Show the version and exit. --hpc TEXT Either "ROSALIND" (default) or "CREATE". Sets path to UKB data. --help Show this message and exit. Commands: clean Removes defunct file/dir(s) from projects, and sets permissions. create Creates a skeleton UKB project directory. link Makes links to sample information and relatedness files. munge Runs rules described in the Snakefile to munge UKB data. util Downloads UKB file handlers and utilities. withdraw Writes withdrawal IDs and corresponding indeces to be excluded.
Note. usage is similar to git
with general options, and
commands that take further arguments. For help on commands (e.g.
prj create
)
prj create --help
At /scratch/datasets/ukbiobank, create a project directory ukb<project_id>.
prj create -p <project_id>
This will create the project directory structure in Figure 1, adding symlinks to the genetic in the project genotyped/ and imputed/ folders, and download the required UKB programs and utilites.
ukb<project_id> ├ genotyped ├ ukb_binary_v2.bed └ ukb_binary_v2.bim ├ imputed ├ ukb_sqc.txt ├ ukb_sqc_fields.txt ├ ukb_imp_chr*.bgen ├ ukb_imp_chr*.bgen.bgi └ ukb_mfi_chr*.txt ├ log ├ phenotypes ├ raw ├ returns └ withdrawals
Figure 1 Project directory structure
For most other operations, you should change into the project folder.
Add UKB file handlers and utilities to the parent directory
/scratch/datasets/ukbiobank folders bin/ and resources/, with
ukb util
. UKB data encodings (Codings_Showcase.csv, encoding.ukb) are
downloaded to resources/; UKB programs are downloaded to bin/.
Download project-specific encrypted files (*.enc), associated key files (*.key), and withdrawal files (w<project-id>_<yyyymmdd>.csv) must be copied into the project subdirectory raw/. Change the key file names to match the encrypted files: ukb<project_id>.enc pairs with ukb<project_id>.key. The first line in each key file should be the project id; the second line should be the decryption key.
Add to raw the project-specific key associated with the genetic data access - rename to ukb<project_id>.key. Download the project-specific genetic sample information files (.fam and .sample) and relatedness file (rel.dat/.txt) into raw/.
cd /scratch/datasets/ukbiobank/ukb_<project_id>/raw/
/scratch/datasets/ukbiobank/bin/gfetch 22418 -c1 -m -a<key_name>.key
/scratch/datasets/ukbiobank/bin/gfetch 22828 -c1 -m -a<key_name>.key
/scratch/datasets/ukbiobank/bin/gfetch rel -a<key_name>.key
Process the encrypted UKB files into formats to be read by ukbkings.
prj munge -p ukb<project_id>
The munged phenotype data are written to phenotypes/ and output
information is written to log/, for every <dataset_id> (or UKB
basket) (Figure 2). For a dry run, in which no files are edited/
written to disk, only details of what would be munged is printed to
standard output, use ukb munge -p ukb<project_id> -n
.
ukb<project_id> ├ phenotypes ├ ukb<dataset_id>.csv ├ ukb<dataset_id>.html └ ukb<dataset_id>_field_finder.text ├ log └ ...
Figure 2 Munged phenotype data
Sample information files (.fam and .sample) and the relatedness file (rel.dat/.txt) should be in raw/. Create symlinks to these project-specific files in genotyped/ and imputed/ (Figure 3).
prj link \
-p ukb<project_id> \
-f <fam_file_name> \
-s <sample_file_name> \
-r <relatedness_file_name>
You can link one or more of these files (they do not all need to be passed to the program simultaneously).
ukb<project_id> ├ genotyped └ ukb<project_id>_cal_chr1_v2_sN.fam ├ imputed ├ ukb<project_id>_imp_chr1_v3_sN.sample └ ukb<project_id>_rel_sN.dat
Figure 3 Project-specific sample information and relatedness symlinks.
N = number of samples with non-negative IDs
UKB genetic data resources:
- Accessing Genetic Data within UK Biobank
- Resource 531: Description of genetic data types
- Resource 664: Instructions for downloading genetic data using ukbgene
- Resource 667: UK Biobank Keyfile
The data should now be available from anywhere on Rosalind through the ukbkings R package. Read Access UKB data on Rosalind for a detailed description of usage. The same usage documentation is included in a package vignette. In R
devtools::install_github("kenhanscombe/ukbkings", dependencies = TRUE, force = TRUE)
vignette("Access UKB data on Rosalind")
Each time an updated set of participant withdrawals is received, add the w<project-id>_<yyyymmdd>.csv file to raw/.
To exclude the latest withdrawals from the phenotype data, you have to
generate your dataset with ukbkings::bio_phen
again. Be aware that if
any researcher on your project does run ukbkings::bio_phen
again, to
grab some other data say, this would apply the latest set of
withdrawals.
To exclude the latest withdrawals from the genotype link files (.fam,
.sample), there is a 2-step process: generate a list of withdrawals with
prj withdraw
, and then remove them from the link files with
prj remove
.
Note. In both cases the row count remains the same: ukbkings::bio_phen
replaces phenotype data values with NA
; prj remove
replaces IDs with
negative integer. This preserves the row count and alignment of files.
If you receive any new data you would like to incorporate, place the new .enc and .key files (prepared as described in [Include project data]) into raw/ and re-run the data munging step (described in [Munge the UKB data]).