hic-scaffolding is a pipeline that can be used for Hi-C scaffolding
with Arima Genomics
Hi-C libraries. The pipeline uses SALSA2 for the scaffolding and it is currently designed for Compute Canada Architecture
This installation guide will provide instructions to install this pipeline on a Compute Canada (Alliance) cluster
First either download and install nextflow >=21.04.3
or load the NextFlow module installed in the CC file system.
e.g
module load StdEnv/2020 nextflow/21.04.3
To load mugqic module you may need to add below lines of codes to your ~/.bash_profile
open ~/.bash_profile using favourite text editor e.g vi ~/.bash_profile
add below lines of codes
umask 0002
## GenPipes/MUGQIC genomes and modules
export MUGQIC_INSTALL_HOME=/cvmfs/soft.mugqic/CentOS6
module use $MUGQIC_INSTALL_HOME/modulefiles
save and close the file
then enter the following code
source ~/.bash_profile
now try to load a muqic module
module load mugqic/python/3.8.5
If everythin is okay you will not have any error
Usually Nextflow requires an active terminal until it finishes all the steps in the pipeline. Therefore you should use a terminal multiplexer like tmux
First open a tmux session typing tmux
or you may type tmux -a
to use an already active tmux session. This way, you can view the outputs in the console. Then goto the working directory
Then you may need to add your RAP-ID
by creating a new config file. Create a file named custom.config
(you can use any name you prefer) and add the below lines of codes and add your RAP-ID
process {
clusterOptions = '--account=YOUR-RAP-ID'
}
To run the pipeline use this code.
nextflow run pubudumanoj/hic-scaffolding -r main -resume --in_dir 'sorted/' -latest --fastq '*R{1,2}_001.fastq.gz' --REF '*.fasta' -profile cc_hpc -c custom.config
However, if you wish to run the pipeline in the background, use -bg
option with the above line of code
You need to specify the directory path of the Hi-C fastq files that you want to use for the scaffolding process in the in_dir
param. Make sure to add a "/" at the end of the path.
Currently this pipeline only supports for paired end fastq files and you must have both forward and reverse reads in order to use the pipeline. If you have fastq files from multiple samples you can name samples as follows
e.g
sample 1
HiC_Afraterculus_L001_R1_001.fastq.gz
HiC_Afraterculus_L001_R2_001.fastq.gz
sample 2
HiC_Afraterculus_L002_R1_001.fastq.gz
HiC_Afraterculus_L002_R2_001.fastq.gz
There should be a common part for all the names of the samples and sample can be uniquily identified by a sample ID (L001 and L002
in above example). This should followed by the read type (R1 and R2)
and the rest should be similar.
After correctly formatting fastq file names you should change the fastq
param accordingly to match the below glob pattern
'*R{1,2}_001.fastq.gz'
Then you should specify the path for the contigs assembly (reference fasta file) using REF
param. Also make sure to clean the scaffold names in the fasta file. If you use simpler form, the output files will be smaller in size and easy to process.
e.g
>scaffold_1
Optionally, you can modifiy each parameter defined in the config file accordingly. In order to do this you can either create a nextflow.config
file in the working directory or add them as arguments to nextflow run
You can run the pipeline either in your local machine, Compute Canada cluster (Narval, Beluga, Cedar and Graham are supported) or McGill University Genome Center cluster Abacus
You can use -profile local
or nothing to run in the local machine
To run in Compute Canada cluster use -profile cc_hpc
To run in Abacus use -profile abacus
(However, the quast step is not tested in Abacus and may fail)
Special thanks to Dr. Rob Syme for continuous support and improvments
Authors of the Tools used