I. Introduction
II. Prerequisites
III. Download and Install
IV. Tutorial
4.1 Common Rules
4.2 Example
V. Get Help
C & S's Toolkit (CSTK) is a package composed of perl, R, python and shell scripts. Using the specific script in the package or combination of scripts in the package as pipeline, may meet the general requirement of bioinformatics analysis and complete miscellaneous tasks. The function modules of CSTK include but not limited in:
A. File format converting
B. Processing of standard format file
C. Table manipulation
D. Statistical testing
E. Survival analysis
F. Data visualization
G. Parsing of third-party tool result
H. Gene/region expression quantification
I. Alternative splicing identification and quantification
CSTK mainly requires the following software/packages:
Software/Package | Version | Description |
---|---|---|
Perl | >=5.010 | The Perl language program. |
BioPerl | Not limited | Needed only by fqConverter.pl and mafLiftOver.pl in current version. |
R | >=3.3.2 | The R language program, mainly used for statistical analysis and data visualization. |
ggplot2 | 2.x | A R package for data visualization. |
Python | =2.7 | The Python language program. |
SAMTools | >=1.5 | The toolkit to manipulate BAM files. |
The given version is just to suggest you to use this version, but not to prohibit you from using older version, although we haven’t tested the older ones. More required software/packages are script-specific. If specific script in CSTK requires specific software/packages, please install it/them according to the error information.
Please download the CSTK from the release page or clone it with git.
It's extremely easy to install CSTK, because all the source codes are written with script languages and no compilation needed. After installing the software/packages required in the "Prerequisites" section, CSTK is ready to be used.
Before using CSTK, firstly add the path of CSTK into the environment variable $PATH, in order to use scripts in CSTK at the command line directly. Assuming CSTK is decompressed into /home/<yourUserName/bin>/CSTK
, run the following command:
PATH=/home/<yourUserName>/bin/CSTK:$PATH
chmod u+x /home/<yourUserName>/bin/CSTK/*.{pl,R,py,sh}
You can add the first command line into your configuration file of environment (in general, it's /home/<yourUserName>/.bashrc
), so that the CSTK is available to be used after you login every time:
cat <EOF >>/home/<yourUserName>/.bashrc
PATH=/home/<yourUserName>/CSTK:$PATH
EOF
Rather than downloading and installing manually, it's recommended to setup up by Docker (https://hub.docker.com/r/zhangsjsky/cstk) or Singularity (https://cloud.sylabs.io/library/_container/5ed705dfd0ff9c878fea7e50). In this way you don't need to install the dependencies, which had been installed in the container.
# For Docker
docker pull zhangsjsky/cstk:latest
# For Singularity
singularity pull library://zhangsjsky/default/cstk
The following rules are commonly applied in CSTK:
A) For script with only one input file, the input is passed in from STDIN (Standard Input), except that the input file is specified with argument; For script with only one output file, the output is printed to STDOUT (Standard Output), meanwhile the STDERR (Standard Error) may be used for log output; For script with 2 output files, the output may be printed to STDOUT and STDERR, respectively, or may be printed to STDOUT and the file specified with option parameter, respectively. Using STDOUT and STDERR preferentially to option is convenient to pipe the analysis procedures into a pipeline. Pipeline can also speed up the analysis, meanwhile avoid the immediate files reading and writing hard disk.
B) Help information of almost all the scripts can be viewed with -h, -help or --help options.
C) For option with value, option and value of Perl and Shell scripts are separated by a space character (e.g. -option value
), while those of R are separated by an equal mark (e.g. -option=value
).
Only one or two scripts in each function module of CSTK are illustrated in following examples, illustration for more scripts will be appended according to users' feedbacks.
- fqConverter.pl
fqConverter.pl -h
Usage: perl fqConverter.pl input.fq >output.fq
If input.fq isn't specified, input from STDIN
If output.fq isn't specified, output to STDOUT
Option:
-s --source STR The quality format of your source fastq file ([fastq-illumina], fastq-solexa, fastq)
-t --target STR The quality format of your target fastq file (fastq-illumina, fastq-solexa, [fastq])
-h --help This help information screen
Since Fastq is the common format for bioinformatics, the content of input file is not illustrated here. For details of Fastq format, please visit wiki.
Running example:
zcat myReads.illumina.fq.gz | fqConverter.pl >myReads.sanger.fq
The command converts the fastq in illumina format (default of -s option) to the famous sanger format (default of -t option).
Pipeline is applied in the command to illustrate the advantage of piping the CSTK. You can also further improve it as:
zcat myReads.illumina.fq.gz | fqConverter.pl | gzip -c >myReads.sanger.fq.gz
In this way, the output fastq is stored as compressed gz file. In these analysis procedures, the CSTK script act only as an adapter in the pipeline, that is to pass in the input from the output of the previous procedure and print the output to the next procedure as its input.
- gpe2bed.pl
gpe2bed.pl -h
Usage: perl gpe2bed.pl INPUT.gpe >OUTPUT.bed
If INPUT.gpe isn't specified, input from STDIN
Output to STDOUT
Option:
-b --bin With bin column
-t --bedType INT Bed type. It can be 3, 6, 9 or 12[12]
-i --itemRgb STR RGB color[0,0,0]
-g --gene Output 'gene name' in INPUT.gpe as bed plus column
-p --plus Output bed plus when there are additional columns in gpe
-h --help Print this help information
Assuming there is an input file (example.gpe) with the following contents:
76 | NM_015113 | chr17 | - | 3907738 | 4046253 | 3910183 | 4046189 | 55 | 3907738,3912176,3912897,3916742,3917383,3917640,3919616,3920664,3921129,3922962,3924422,3926002,3928212,3935419,3936121,3937308,3945722,3947517,3953001,3954074,3955264,3957350,3959509,3961287,3962464,3966046,3967654,3969740,3970456,3973977,3975901,3977443,3978390,3978556,3979930,3980161,3981176,3984669,3985730,3988963,3989779,3990727,3991971,3994012,3999124,3999902,4005610,4007926,4008986,4012946,4015902,4017592,4020265,4027200,4045835, | 3910264,3912248,3913051,3916908,3917482,3917809,3919760,3921024,3921265,3923063,3924614,3926122,3928412,3935552,3936296,3937586,3945862,3947668,3953153,3954337,3955430,3957489,3959639,3961449,3962584,3966211,3968123,3969834,3970536,3974218,3976050,3977645,3978472,3978723,3980053,3980283,3981336,3984784,3985798,3989097,3989949,3990828,3992187,3994124,3999273,3999994,4005709,4008105,4009103,4013157,4016102,4017764,4020460,4027345,4046253, | 0 | ZZEF1 | cmpl | cmpl | 0,0,2,1,1,0,0,0,2,0,0,0,1,0,2,0,1,0,1,2,1,0,2,2,2,2,1,0,1,0,1,0,2,0,0,1,0,2,0,1,2,0,0,2,0,1,1,2,2,1,2,1,1,0,0, |
147 | NM_001308237 | chr1 | - | 78028100 | 78149112 | 78031324 | 78105156 | 14 | 78028100,78031765,78034016,78041752,78044458,78045211,78046682,78047460,78047663,78050201,78105133,78107068,78107206,78148946, | 78031469,78031866,78034151,78041905,78044554,78045313,78046754,78047576,78047811,78050340,78105287,78107131,78107340,78149112, | 0 | ZZZ3 | cmpl | cmpl | 2,0,0,0,0,0,0,1,0,2,0,-1,-1,-1, |
147 | NM_015534 | chr1 | - | 78028100 | 78148343 | 78031324 | 78099039 | 15 | 78028100,78031765,78034016,78041752,78044458,78045211,78046682,78047460,78047663,78050201,78097534,78105133,78107068,78107206,78148269, | 78031469,78031866,78034151,78041905,78044554,78045313,78046754,78047576,78047811,78050340,78099090,78105287,78107131,78107340,78148343, | 0 | ZZZ3 | cmpl | cmpl | 2,0,0,0,0,0,0,1,0,2,0,-1,-1,-1,-1, |
As shown, there is a bin column (the first column) in the gpe file, so the -b option should be specified. For the description of gpe file, please visit UCSC: http://genome.ucsc.edu/FAQ/FAQformat.html#format9.
Run the script:
gpe2bed.pl -b example.gpe >example.bed
The content of the output bed file:
chr17 | 3907738 | 4046253 | NM_015113 | 0 | - | 3910183 | 4046189 | 0,0,0 | 55 | 2526,72,154,166,99,169,144,360,136,101,192,120,200,133,175,278,140,151,152,263,166,139,130,162,120,165,469,94,80,241,149,202,82,167,123,122,160,115,68,134,170,101,216,112,149,92,99,179,117,211,200,172,195,145,418 | 0,4438,5159,9004,9645,9902,11878,12926,13391,15224,16684,18264,20474,27681,28383,29570,37984,39779,45263,46336,47526,49612,51771,53549,54726,58308,59916,62002,62718,66239,68163,69705,70652,70818,72192,72423,73438,76931,77992,81225,82041,82989,84233,86274,91386,92164,97872,100188,101248,105208,108164,109854,112527,119462,138097 |
chr1 | 78028100 | 78149112 | NM_001308237 | 0 | - | 78031324 | 78105156 | 0,0,0 | 14 | 3369,101,135,153,96,102,72,116,148,139,154,63,134,166 | 0,3665,5916,13652,16358,17111,18582,19360,19563,22101,77033,78968,79106,120846 |
chr1 | 78028100 | 78148343 | NM_015534 | 0 | - | 78031324 | 78099039 | 0,0,0 | 15 | 3369,101,135,153,96,102,72,116,148,139,1556,154,63,134,74 | 0,3665,5916,13652,16358,17111,18582,19360,19563,22101,69434,77033,78968,79106,120169 |
The -t is 12 in default, so the output bed file is bed12 format in default. You can also try the -g or -p option to modify or add columns in output.
- gpeMerge.pl
gpeMerge.pl -h
Usage: perl gpeMerge.pl input.gpe >output.gpe
If input.gpe not specified, input from STDIN
Output to STDOUT
-b --bin With bin column
-l --locus Merge with locus (default: merge gene)
-t --longTranscript Overlap is against long transcript (default against short transcript)
-n --name Set the "gene name" column as "transcript name(s)" when the corresponding gene name unavailable
-p --percent DOU Minimal overlap percent to merge transcript (default: 0)
-h --help Print this help information
This script can merge different transcripts of the same gene (or the same locus if the -l option specified). The merging criterion is: for each site of a gene, if in any transcript the site is located in exon, the site is treated as exonic site in the merged result, otherwise treated as intronic site.
A diagram to intuitively illustrate the merging:
From C.
In the figure, the top and middle lines are the two transcripts of the same gene, the bottom line is the result after merging.
Use the example.gpe file in the previous section as input to run this script:
gpeMerge.pl -b example.gpe >merged.gpe
The content of the output:
6 | NM_015113 | chr17 | - | 3907738 | 4046253 | 3910183 | 4046189 | 55 | 3907738,3912176,3912897,3916742,3917383,3917640,3919616,3920664,3921129,3922962,3924422,3926002,3928212,3935419,3936121,3937308,3945722,3947517,3953001,3954074,3955264,3957350,3959509,3961287,3962464,3966046,3967654,3969740,3970456,3973977,3975901,3977443,3978390,3978556,3979930,3980161,3981176,3984669,3985730,3988963,3989779,3990727,3991971,3994012,3999124,3999902,4005610,4007926,4008986,4012946,4015902,4017592,4020265,4027200,4045835, | 3910264,3912248,3913051,3916908,3917482,3917809,3919760,3921024,3921265,3923063,3924614,3926122,3928412,3935552,3936296,3937586,3945862,3947668,3953153,3954337,3955430,3957489,3959639,3961449,3962584,3966211,3968123,3969834,3970536,3974218,3976050,3977645,3978472,3978723,3980053,3980283,3981336,3984784,3985798,3989097,3989949,3990828,3992187,3994124,3999273,3999994,4005709,4008105,4009103,4013157,4016102,4017764,4020460,4027345,4046253, | 0 | ZZEF1 | cmpl | cmpl | 0,0,2,1,1,0,0,0,2,0,0,0,1,0,2,0,1,0,1,2,1,0,2,2,2,2,1,0,1,0,1,0,2,0,0,1,0,2,0,1,2,0,0,2,0,1,1,2,2,1,2,1,1,0,0, |
147 | NM_001308237,NM_015534 | chr1 | - | 78028100 | 78149112 | 78028100 | 78149112 | 16 | 78028100,78031765,78034016,78041752,78044458,78045211,78046682,78047460,78047663,78050201,78097534,78105133,78107068,78107206,78148269,78148946 | 78031469,78031866,78034151,78041905,78044554,78045313,78046754,78047576,78047811,78050340,78099090,78105287,78107131,78107340,78148343,78149112 | 0 | ZZZ3 | unk | unk | -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, |
As you can see, the two records of the transcripts of the ZZZ3 gene have been merged into one record. The start and end of exons are updated as the coordinates after merging and the other information associated with coordinate is also updated according to the coordinates after merging. The column of transcript name is also updated as comma-separated transcript list.
- gpeFeature.pl
gpeFeature.pl -h
Usage: perl gpeFeature.pl OPTION INPUT.gpe >OUTPUT.bed
If INPUT.gpe isn't specified, input from STDIN
Example: perl gpeFeature.pl -b -g hg19.size --upstream 1000 hg19.refGene.gpe >hg19.refGene.bed
Option:
-b --bin With bin column
-i --intron Fetch introns in each transcript
-e --exon Fetch exons in each transcript
-c --cds Fetch CDS in each transcript
-u --utr Fetch UTRs in each transcript, 5'UTR then 3'UTR (or 3' first)
-p --prime INT 5 for 5'UTR, 3 for 3'UTR(force -u)
--complete Only fetch UTR for completed transcripts
--upstream INT Fetch upstream INT intergenic regions(force -g)
--downstream INT Fetch downstream INT intergenic regions(force -g)
-g --chrSize FILE Tab-separated file with two columns: chr name and its length
-s --single Bundle all features into single line for each transcript
--addIndex Add exon/intron/CDS/UTR index as suffix of name in the 4th column
-h --help Print this help information
This script is used to extract specific feature from the gpe file and output in bed format.
For example, to extract exons:
gpeFeature.pl -b example.gpe -e >exon.bed6+
chr17 | 3907738 | 3910264 | NM_015113 | 0 | - | NM_015113 | 2526 | ZZEF1 |
chr17 | 3912176 | 3912248 | NM_015113 | 0 | - | NM_015113 | 72 | ZZEF1 |
chr17 | 3912897 | 3913051 | NM_015113 | 0 | - | NM_015113 | 154 | ZZEF1 |
More records of the output are omitted. The result is in bed6+ format with each line representing an exon. The last two columns present the exon length and gene name.
- tsvFilter.pl
tsvFilter.pl -h
Usage: perl tsvFilter.pl -o originFile.tsv -1 1,4 -m i|include targetFile.tsv >filtered.tsv
If targetFile.tsv isn't specified, input is from STDIN
Output to STDOUT
Option:
-o --originFile TSV The original file containing fields (specified by --originFields) used to include or exclude lines in targetFile.tab
-1 --originFields STR Comma-separated field list specifying which fileds in the originFile.tab to be used to include or exclude, 1-based start [1]
The element of the list can be a single column number or a range with nonnumeric char as separator
To specify the last column left the range right margin blank
If continuous range specified like '1-3-6', the first range '1-3' will be output
e.g.:
-1 1,4 output columns 1,4
-1 1-4,6..8 output columns 1,2,3,4,6,7,8
-1 1,4,6- output columns 1,4,6,7,... last column
-1 1-3-6 output columns 1,2,3
-2 --targetFields STR Comma-separated field list specifying which fileds in the targetFile.tab are used to include or exclude lines, 1-based start [1]
More description about --targetFields, see --originFields
-m --mode STR To include or exclude lines in targetFile.tab, it can be i|include or e|exclude[e]
-s --separator STR (Optional)A separator to join the fields specified, if necessary[Empty string]
-h --help Print this help information
This script is often used in table manipulation. It's function is to filter a table (i.g. target table) according to one column or some columns (i.g. the target columns) with one or some columns (i.g. the source column) of another table (i.g. source table). The mode on it's based is whether target columns include (or exclude) the source columns.
Assuming the first column of the source table (source.tsv) is gene name, the second column is up- or down- regulation mark:
Gene1 | Up |
Gene2 | Down |
Gene3 | Up |
The first column of the target table (target.tsv) is gene name, the second column is up- or down- regulation fold-change:
Gene1 | 5 |
Gene2 | 10 |
Gene3 | 2 |
If you want to know the fold-change of the up-regulated genes, run the following command:
tsvFilter.pl -o <(awk '$2=="Up"' source.tsv) -m i target.tsv
The output is:
Gene1 | 5 |
Gene3 | 2 |
The command fetches up-regulated genes with awk firstly, and then feeds it to the -o option as input to filter the target.tsv file. -m option specifies the filtering mode as "include". Because both the source column and target column are the first column, -1 and -2 options are specified as 1 in default.
- tsvJoin.sh
tsvJoin.sh
Usage: tsvJoin.sh OPTIONS [-1 field1][-2 fields2] input1.tsv [input2.tsv]
Note: input2.tsv can be omitted or be a "-" to input the data from STDIN.
when input1.tsv is a "-" (i.e. from STDIN), input2.tsv must be specified.
Options:
-1|field1 STR The field in input1.tsv used when joining[1]
Refer to the -1 option of linux join command
-2 field2 STR The field in input2.tsv used when joining[1]
-i|inputDelimiter STR The delimiter of your input file[\t]
-j|joinDelimiter STR The delimiter used by linux join commond (i.e. the -t option of join command)[|]
-o|outputDelimiter STR The delimiter of the output[\t]
-a|unpairedFile INT Also print unpairable lines from file INT (1, 2)
Please specify at least one file
This script is also often used in table manipulation. It's function is to join two tables according to whether the values of one or some columns are the same. Different from the join command built in Linux, this script can use the tab character as separated character.
Take the previous source.tsv and target.tsv as examples, join them according to gene name:
tsvJoin.sh source.tsv target.tsv >join.tsv 2>/dev/null
The output join.tsv is:
Gene1 | Up | 5 |
Gene2 | Down | 10 |
Gene3 | Up | 2 |
In the command, '2>/dev/null' means discarding the STDERR (for output running information). Because the columns used to join in the two input table are both the first column, -1 and -2 options are specified as 1 in default.
In the above output, the column used to do joining (i.g. the gene name column) is set as the first column and is followed by the other columns in the first and the second files, except for the column used to do joining.
- testT.R
testT.R -h
Usage: testT.R -option=value <input.lst|<input1.lst input2.lst|input1.lst input2.lst >pValue
Option:
-p|pair Pair
-a|alt STR The alternative hypothesis: [two.sided], greatr or less
-h Show help
This script is used to conduct T-test testing. '<input.lst|<input1.lst input2.lst|input1.lst input2.lst' means there are a few input manners:
A) Input one file from STDIN;
B) Input from STDIN and file (specified by an argument), respectively;
C) Input from two files (specified by two arguments), respectively.
The commonly used manner is the last one. Assuming the content of the first file (value1.lst) is:
1
2
3
4
5
And that of the second file is:
3
4
5
6
7
8
testT.R value1.lst value2.lst
0.0398030820241363
The output p-value indicates there is significant difference between the two list of values.
- testWilcoxon.R
testWilcoxon.R -h
Usage: testWilcoxon.R -option=value <input.lst|input.lst|input1.lst input2.lst >pValue
Options
-a|alt STR The alternative hypothesis ([two.sided], greater, less)
You can specify just the initial letter
-m|mu DOU A parameter used to form the null hypothesis for one-sample test[0]
-h|help Show help
testWilcoxon.R value1.lst value2.lst 2>/dev/null
0.06601543
The output p-value indicates there isn't significant difference between the two list of values. The conclusion is different from the T-test.
- survival.R
survival.R -h
Usage: survival.R -option=value <input.tsv
Option:
-p|pdf PDF The KM figure[KM.pdf]
-w|width INT The figure width
height INT The figure height
header With header
-m|main STR The main title
-x|xlab STR The xlab[Time]
-y|ylab STR The ylab[Survival Probability]
-h Show help
Input (header isn't necessary):
Example1:
#time event
1 TRUE
2 TRUE
2 TRUE
8 FALSE
5 TRUE
10 FALSE
Example2:
#time event group
1 TRUE male
2 TRUE male
2 TRUE male
8 FALSE male
5 TRUE female
10 FALSE female
As the help information presents, creating an input file as the Example2 (survival.tsv). Pass it to the script as input and specify the output pdf file:
survival.R -p=survival.pdf <survival.tsv >survival.log 2>survival.err
Survival curve (i.g. Kaplan-Meier curve) and the log-rank p-value are stored in the pdf. Statistical measurements are stored in survival.log.
- bar.R
bar.R -h
Usage: bar.R -p=outputName.pdf <input.tsv
Option:
Common:
-p|pdf FILE The output figure in pdf[figure.pdf]
-w|width INT The figure width
-height INT The figure height
-m|main STR The main title
Contents omitted…
-annoTxt STRs The comma-separated texts to be annotated
-annoTxtX INTs The comma-separated X positions of text
-annoTxtY INTs The comma-separated Y positions of text
-annoTxtS DOU The annotated text size[5]
Skill:
Legend title of alpha, color, etc can be set as the same to merge their guides
This script is used to draw bar chart. Take the survival.tsv in the previous section as input to draw:
awk 'BEGIN{OFS="\t"}{print NR,$1,$3}' survival.tsv | bar.R -p=bar.pdf -fillV=V3 -x='Patient ID' -y=Time -fillT=Gender
In the command, awk is used to process the input before passing the data to bar.R. The processed results are:
1 | 1 | male |
2 | 2 | male |
3 | 4 | male |
4 | 8 | male |
5 | 5 | female |
6 | 10 | female |
The extra patient ID information is added as the first column and will be presented at X axis. The second column is survival time and will be presented at Y axis. The third column is gender and will be used to fill bar with different color. If the third column isn't presented, all the bars are filled in black in default.
In the parameters of bar.R, -fillV=V3
tells it to use the third column (column name: V3) to color bars. Don't specify this option if no third column. -x and -y specify the label of X and Y axis, respectively. -fillT specify the title of the fill-legend.
- hist.R
hist.R -h
Usage: hist.R -p=outputName.pdf <input.tsv
Option:
Common:
-p|pdf FILE The output figure in pdf[figure.pdf]
-w|width INT The figure width
-m|main STR The main title
-mainS DOU The size of main title[22 for ggplot]
-x|xlab STR The xlab[Binned Values]
-y|ylab STR The ylab
Contents omitted…
-annoTxt STRs The comma-separated texts to be annotated
-annoTxtX INTs The comma-separated X positions of text
-annoTxtY INTs The comma-separated Y positions of text
Skill:
Legend title of alpha, color, etc can be set as the same to merge their guides
This script is used to draw histogram.
hist.R -p=hist.pdf -x=Time -y='Patient Count' <survival.tsv
Only one column of values is needed to draw a histogram, so pass in survival.tsv as input then the first column will be used default to draw histogram.
- coverageBedParser.pl
coverageBedParser.pl -h
Usage: perl coverageBedParser.pl coverageBedOutput.tsv >OUTPUT.tsv
If coverageBedOutput.tsv isn't specified, input from STDIN
Option:
-b -bedFormat INT Bed format ([3], 6)
-h --help Print this help information
This script is used to parse the output result of 'bedtools coverage
', calculate the mean depth and covered fraction of each region.
coverageBedParser.pl coverage.bed >coverageParsed.bed
- fastqcParser.pl
fastqcParser.pl -h
Usage: perl fastqcParser.pl OPTION fastqc_data.txt >OUTPUT.tsv
If fastqc_data.txt isn't specified, input from STDIN
Options:
-h --help Print this help information
This script is used to parse the result (in general, the fastqc_data.txt) of FastQC, extract read-count, GC content and mean quality etc. and output as a table.
fastqcParser.pl fastqc_data.txt >fastqcDataParsed.tsv
- geneRPKM.pl
geneRPKM.pl -h
Usage: perl geneRPKM.pl -g gene_structure.gpe -s 4 INPUT.BAM >RPKM.bed6+ 2>running.log
If INPUT.BAM isn't specified, input is from STDIN
Output to STDOUT in bed6 (gene in name column, RPKM in score column) plus longest transcript, readNO and transcript length
This script chooses the LONGEST transcript of each gene as reference transcript to measure RPKM
Option:
-g --gpe FILE A gpe file with comment or track line allowed
-b --bin With bin column
-l --libType STR The library type, it can be
fr-unstranded: for Standard Illumina (default)
fr-firststrand: for dUTP, NSR, NNSR
fr-secondstrand: for Ligation, Standard SOLiD and Illumina Directional Protocol
-s --slop INT Specify the slopping length from the exon-intron joint to intron[0]
-u --uniq Only use uniquely-mapped reads (NH=1)to compute RPKM
--log FILE Record running log into FILE
-h --help Print this help information
As the help information describes, this script chooses the longest transcript of each gene as the reference to quantify the expression of each gene in RPKM. The output result is stored in bed6+. A running example:
geneRPKM.pl -g refGene.gpe -b -s 4 final.bam >RPKM.bed6+ 2>geneRPKM.log
In the command, -s option specifies the length in which reads can extend from exon into intron. This option is devised to handle the situation that reads may be incapable of spanning an intron to be aligned to adjacent exon with a few base-pair.
This script may consume too much memory (depend on the size of the input bam file). If the memory isn't enough on the machine, the script geneRPKM_mem.pl, which consumes little memory but runs slower, may be an alternative choice.
- regionRPKM.pl
regionRPKM.pl -h
Usage: perl regionRPKM.pl -b region.bed INPUT.bam >RPKM.bed 2>running.log
If INPUT.bam isn't specified, input from STDIN
Output to STDOUT with bed columns plus reads count in region and its RPKM
Note: INPUT.bam should be indexed with samtools index
This script is for handling bam file in normal size that can be entirely cached into memory.
It's MEMORY-CONSUMED but low TIME-CONSUMED compared to its equivalent regionRPKM_mem.pl.
Spliced reads are handled now. Those that include the whole region within intron aren't counted.
Option:
-b|bedFile FILE Region file in bed4 or bed6 format. bed plus is allowed.
-l|libType STR The library type, it can be
fr-unstranded: for Standard Illumina (default)
fr-firststrand: for dUTP, NSR, NNSR
fr-secondstrand: for Ligation, Standard SOLiD and Illumina Directional Protocol
-h --help Print this help information
This script calculates the RPKM of specific regions (specified by --bedFile option). A running example:
regionRPKM.pl --bedFile exon.bed final.bam >RPKM.bed 2>regionRPKM.log
Similarly, if the memory isn't enough, use regionRPKM_mem.pl as alternative.
Only identification and quantification of SE (Skipping Exon) event are implemented in current version.
- psiSE.pl
psiSE.pl -h
Usage: perl psiSE.pl INPUT.bam >OUTPUT.bed6+
If INPUT.bam isn't specified, input from STDIN
Option:
-b --bed FILE Gene models in bed12 format
-l --libraryType STR The library type, it can be
fr-unstranded: for Standard Illumina (default)
fr-firststrand: for dUTP, NSR, NNSR
fr-secondstrand: for Ligation, Standard SOLiD and Illumina Directional Protocol
-s --slop INT Maximal slope length for a read to be considered as exonic read[4]
-r --minRead INT Minimal supporting reads count for an exclusion junction[2]
-h --help Print this help information
Output:
The 4th column is the transcript name and the exon rank (in transcriptional direction) separated by a dot.
The 5th column in OUTPUT.bed6+ is the PSI normalized into 0-1000.
Additional columns are as follow:
inclusion read count
inclusion region length
inclusion read density
exclusion read counts separated by comma
exclusion region lengths separated by comma
exclusion read density
This script is used to identify SE events and quantify the PSI of them from the alignments. An using example:
psiSE.pl final.bam >SE.bed6+
Refer to the help information for the format of the output result.
You can send the author Sky any information about this toolkit, like bug reporting, typos and performance improvement suggestion.