The Observed Antibody Space (OAS) is a collection of raw outputs from 58 Ig-seq experiments, covering over half a billion of sequences, and containing data from different species, disease states, age groups and B cell types [Kovaltsuk et al., 2018].
The files from this study were downloaded from the following link: http://antibodymap.org/ (downloaded 2019.09.06)
This GitHub repository contains a Snakemake pipeline to go from original FASTA files to a final table that contains concatenated CDR1, CDR2 and CDR3 regions for each one of the studies. In a separate table, the metadata from each sample was extracted (straight from the available JSON files). The metadata and final data table can be joined by a unique identifier.
Following files should be edited based on your needs:
- config.yaml: the config file contains all information about the studies to be fused together in the final table, input and output directories, and the project name.
- cluster.json: depending on your resource usage, change this file.
- Snakefile: contains all the steps in the pipeline. Here, the code for MiXCR alignment is run, and can be changed to one's needs.
- run_smk.sh: can be used to fine-tune the settings for running the pipeline, e.g. how many times the script should re-run if an error occurs. Some additional information about the parameters for running this snakemake pipeline are found in the file 'info_run_snakemake'.
Following command can be used to run the pipeline:
./run_smk.sh
- untar_fasta: take fasta files, untar them, and save them in the original directory
- mixcr_analyze: alignment with MiXCR
- fuse_studies: take all aligned files inside a study, and fuse them into a bigger dataframe
- fuse_chains: fuse the three chains together and output a single table; the chain is defined inside a column
- fuse_all_tables: fuse all the tables together from the different studies. Outputs the final table of this Snakemake pipeline
- extract_and_save_metadata: takes the gzipped json files containing the metadata, and outputs a table with the metadata for every single sample
For any remaining questions or inquiries, send an email to: [email protected]