Contavect is a python3.10 object oriented script, developed to quantify and characterize DNA contaminants from gene therapy vector production after NGS sequencing. This automated pipeline can however be used for wider purpose requiring to identify map NGS datasets consisting of a mix of DNA sequences on multiple references. It combine several features such as reference homologies masking, fastq filtering/adapter trimming, short read alignments, SAM file splitting and generating human readable output.
Contavect a python pipeline composed of several modules linked together to analyze NGS Data. Here is a description of the overall workflow principle :
- Each reference fasta file is parsed to identify all sequences within it and a Reference object is initialized to save the reference characteristics, the name and the output required.
- Facultative: Homologies between references can be masked iteratively, starting by the last reference which is masked by all the others then to the penultimate masked by all others except the last and and so forth until there is only 1 reference remaining. This is done using blastn from blast+ package.
- Facultative: Fastq can be filtered by mean quality and adapters can be trimmed using an homemade fully integrated fastq filter parallel processing module written in python and C.
- If needed an index for bwa will be generated from the modified reference files or from the original one after being merged together in a temporary directory. Then Fastq sequences are then aligned against the bwa merged reference genome index and a temporary sam file is generated
- Aligned reads from the sam file are spitted and attributed to the reference Object for which a hit was found. or to one of the following garbage reads categories: unmaped, lowMapq, secondary.
- Each reference will then generates the output required in the configuration file (Bam, sam, bedgraph, bed and covgraph).
- Finally distribution reports and a log file are generated.
For more information, a comprehensive developer documentation can be generated from ContaVect.dox using Doxygen with doxypy.
First of all, clone the repository:
$ git clone https://github.com/emlec/ContaVect/
You can then use Singularity or Conda to proceed.
Singularity (3.7.0 or above) must be installed on your system.
To build the singularity container, admin rights are needed. Use the following command:
sudo singularity build Contavect singularity/Singularity.rcp
To check dependencies versions in the container, use:
singularity run --app versions Contavect
To start a shell in the container, and use ContaVect, use:
singularity shell Contavect
Conda (4.10.1 or above) must be installed on your system. To build the conda environment use:
conda env create -f conda/Conda.yml
To activate it, and start ContaVect, use:
conda activate ContaVect
Prepare the configuration file to include your files and settings as indicated in the template Conf.txt file provided with the source files
Usage: ContaVect.py -c Conf.txt [-i -h]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i NUMBER_REF Generate an empty configuration file adapted to the number of
reference sequences to align the reads to and exit.
[Mandatory]
-c CONF_FILE Path to the configuration file [Mandatory]
An empty configuration file can be generated by running the program with the option -i NUMBER_REF
. The NUMBER_REF correspond to the number of reference serving for the aligment.
All options are extensively described in the configuration file.
The test files from the dataset in the test folder can be use to control the ContaVect program installation.
$ cd ./test/dataset/
ContaVect.py -c Conf_example_file.txt
The expected results for the dataset of test are presents in the expected_result folder.
Export the path to conf/matplotlibrc file as the MATPLOTLIBRC variable for headless cluster runs.
2 possibilities:
- Use ipython notebook with doc/Logbook.ipynb
- Consult directly online through nbviewer : Notebook
- Adrien Leger [email protected] @a-slide
- Emilie Lecomte [email protected] @emlec