Skip to content

This repo contains the scripts that can be used for Hi-C scaffolding with Arima Genomics Hi-C libraries

Notifications You must be signed in to change notification settings

pubudumanoj/hic-scaffolding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nextflow

hic-scaffolding

hic-scaffolding is a pipeline that can be used for Hi-C scaffolding with Arima Genomics Hi-C libraries. The pipeline uses SALSA2 for the scaffolding and it is currently designed for Compute Canada Architecture

Installation

This installation guide will provide instructions to install this pipeline on a Compute Canada (Alliance) cluster

First either download and install nextflow >=21.04.3 or load the NextFlow module installed in the CC file system. e.g

module load StdEnv/2020 nextflow/21.04.3

To load mugqic module you may need to add below lines of codes to your ~/.bash_profile

open ~/.bash_profile using favourite text editor e.g vi ~/.bash_profile

add below lines of codes

umask 0002
 
## GenPipes/MUGQIC genomes and modules
export MUGQIC_INSTALL_HOME=/cvmfs/soft.mugqic/CentOS6
module use $MUGQIC_INSTALL_HOME/modulefiles

save and close the file

then enter the following code

source ~/.bash_profile

now try to load a muqic module module load mugqic/python/3.8.5

If everythin is okay you will not have any error

Usage

Usually Nextflow requires an active terminal until it finishes all the steps in the pipeline. Therefore you should use a terminal multiplexer like tmux

First open a tmux session typing tmux or you may type tmux -a to use an already active tmux session. This way, you can view the outputs in the console. Then goto the working directory

Then you may need to add your RAP-ID by creating a new config file. Create a file named custom.config (you can use any name you prefer) and add the below lines of codes and add your RAP-ID

process {
  clusterOptions = '--account=YOUR-RAP-ID'
}

To run the pipeline use this code.

nextflow run pubudumanoj/hic-scaffolding -r main -resume --in_dir 'sorted/' -latest --fastq '*R{1,2}_001.fastq.gz' --REF '*.fasta' -profile cc_hpc -c custom.config

However, if you wish to run the pipeline in the background, use -bg option with the above line of code

You need to specify the directory path of the Hi-C fastq files that you want to use for the scaffolding process in the in_dir param. Make sure to add a "/" at the end of the path.

How to name the fastq files

Currently this pipeline only supports for paired end fastq files and you must have both forward and reverse reads in order to use the pipeline. If you have fastq files from multiple samples you can name samples as follows

e.g
sample 1
HiC_Afraterculus_L001_R1_001.fastq.gz
HiC_Afraterculus_L001_R2_001.fastq.gz

sample 2
HiC_Afraterculus_L002_R1_001.fastq.gz
HiC_Afraterculus_L002_R2_001.fastq.gz

There should be a common part for all the names of the samples and sample can be uniquily identified by a sample ID (L001 and L002 in above example). This should followed by the read type (R1 and R2) and the rest should be similar.

After correctly formatting fastq file names you should change the fastq param accordingly to match the below glob pattern
'*R{1,2}_001.fastq.gz'

Then you should specify the path for the contigs assembly (reference fasta file) using REF param. Also make sure to clean the scaffold names in the fasta file. If you use simpler form, the output files will be smaller in size and easy to process. e.g >scaffold_1

Optionally, you can modifiy each parameter defined in the config file accordingly. In order to do this you can either create a nextflow.configfile in the working directory or add them as arguments to nextflow run

Running the pipeline in different infrastructures

You can run the pipeline either in your local machine, Compute Canada cluster (Narval, Beluga, Cedar and Graham are supported) or McGill University Genome Center cluster Abacus You can use -profile local or nothing to run in the local machine

To run in Compute Canada cluster use -profile cc_hpc

To run in Abacus use -profile abacus (However, the quast step is not tested in Abacus and may fail)

Acknowledgement

Special thanks to Dr. Rob Syme for continuous support and improvments
Authors of the Tools used

About

This repo contains the scripts that can be used for Hi-C scaffolding with Arima Genomics Hi-C libraries

Resources

Stars

Watchers

Forks

Packages

No packages published