-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.txt
141 lines (98 loc) · 6.44 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
## FASTXTEND
Fastxtend is an extension of FASTX-Toolkit package (http://hannonlab.cshl.edu/fastx_toolkit/index.html).
Fastxtend is based on the library developed in fastx toolkit, it is written is C and it extend the FASTX-Toolkit with four commands line tools for Short-Reads FASTA/FASTQ files preprocessing:
- fastx_clean allows cleaning (adapters, N, quality) of the reads in fastq files.
- fastx_duplicatedReads estimates the duplicates rate of reads (single or pair reads). It computes a rapid and accurate estimation of the duplicates rate of an initial read set using a sample of this read set.
- fastx_mergepairs perform the merging of the paired reads and give some statistics (merged size, percent of pairs merged).
- fastx_stats
Fastxtend is distributed open-source under CeCILL
FREE SOFTWARE LICENSE. Check out http://www.cecill.info/
for more information about the contents of this license.
Fastxtend home on the web is http://www.genoscope.cns.fr/fastxtend/
## COMMAND LINE ARGUMENTS
- All of the tools show usage information with -h
- The option -Q is not documented in the usages it corresponds to the ASCII offset (generally -Q 33) and performs with all commands.
#### fastx_clean
$ fastx_clean -h
usage: fastx_clean [-h] [-a ADAPTER_FILE] [-D] [-l N] [-n N] [-M N] [-m N] [-p N] [-c] [-C] [-o] [-v] [-z] [-i INFILE] [-o OUTFILE]
Developped at Genoscope using the FASTX Toolkit 0.0.13.1
[-h] = This helpful help screen.
[-a ADAPTER_FILE] = ADAPTER file in fasta format.
[-j] = Keep the longest sequence before adaptater.
[-l N] = Discard sequences shorter than N nucleotides. default is 10.
[-q N] = Quality threshold - nucleotides with lower quality will be trimmed (from both ends of the sequence).
default value is 2, use 0 to inactivate this trimming.
[-n N] = Trim sequences after N unknown nucleotides. default is 0 = off. This cleaning is done on the trimmed (adapter + quality) sequence.
[-c] = Discard non-clipped sequences (i.e. - keep only sequences which contained the adapter).
[-C] = Discard clipped sequences (i.e. - keep only sequences which did not contained the adapter).
[-k] = Report Adapter-Only sequences.
[-v] = Verbose - report number of sequences.
If [-o] is specified, report will be printed to STDOUT.
If [-o] is not specified (and output goes to STDOUT),
report will be printed to STDERR.
[-z] = Compress output with GZIP.
[-D] = DEBUG output.
[-M N] = Require minimum adapter alignment length of N.
If less than N nucleotides aligned with the adapter - don't clip it.
[-m N] = Maximum number of mismatches allowed, default is 3
[-p N] = Allow one mismatch every N bases, default is 5 so one mismatch allow every 5 nucleotides
[-r] = Reverse complement the input fastx file and clip.
[-f] = Clip the input fastx file in forward, default is true.
[-e] = Recursive alignment : iterate alignments between sequence and adapters until a match is found.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
[-s STAT_FILE] = Tabular output file which contains trimming details for each input sequence.
#### fastx_duplicatedReads
$ fastx_duplicatedReads -h
usage: fastx_duplicatedReads [-h] [-v] [-i INFILE] [-o OUTFILE]
Developped at Genoscope using the FASTX Toolkit 0.0.13.1
[-h] = This helpful help screen.
[-s SAMPLE] = FASTA/Q input file of a sample extract from INFILE
[-i INFILE] = FASTA/Q input file. Default is STDIN.
[-t SAMPLE2] = FASTA/Q input file of a sample extract from INFILE2 (Optional)
[-j INFILE2] = FASTA/Q input file for Read 2 (Optional)
[-c INT] = Trim sides of reads by a specified percentage (default: 0%)
#### fastx_mergepairs
$ fastx_mergepairs -h
usage: fastx_mergepairs [-h] [-l N] [-m N] [-i N] [-s] [-M] [-a INFILE1] [-b INFILE2] [-o OUTFILE]
Developped at Genoscope using the FASTX Toolkit 0.0.13.1
[-h] = This helpful help screen.
[-l N] = Fragment size of read 2 used for detecting an overlap, default is 40
[-m N] = Maximal number of mismatches of the alignment of read 1 and subpart of read2, default is 4
[-i N] = Minimal identity percent of the alignment of read 1 and subpart of read2, default is 90
[-L N] = Minimal size of the alignment, default is 15
[-s] = Silent mode
[-a INFILE1] = FASTA/Q input file
[-b INFILE2] = FASTA/Q input file
[-o OUTFILE] = FASTA/Q output file
[-u OUTFILE1] = FASTA/Q output file of unpaired reads (e.g. merged reads).
[-p OUTFILE1] = FASTA/Q output file of paired reads (e.g. non-merged reads).
[-q OUTFILE1a] = FASTA/Q output file of read1 from paired reads (e.g. non-merged reads).
[-r OUTFILE1b] = FASTA/Q output file of read2 from paired reads (e.g. non-merged reads).
Choose between [-o] and [-u],[-p] and [-u],[-q],[-r] arguments.
[-x FILE] = Print distribution of overlap size between read1 and read2
[-M] = Only print merged pairs, default is no.
#### fastx_stats
$ fastx_stats -h
usage: fastx_stats [[-h] [-f INFILE] [-q]]
[-h] = This helpful help screen.
[-f INFILE] = FASTA/Q input file. default is STDIN.
[-q] = Display quality values distribution.
## PRE-REQUISITES
- A Linux based operating system.
- Binaries are provided for the following platform : Linux x86_64
- g++ with gcc 4.1.2 or higher
## INSTALLATION
1. Clone this GitHub repository
2. Compile sources
`make;`
3. Install binaries
`make install`
This will install the tools into ./bin
## More informations
If you have questions about Fastxtend, you may ask them to sengelen [at] genoscope [.] cns [.] fr and jmaury [at] genoscope [.] cns [.] fr . You may also create an issue to ask questions on github website: https://github.com/institut-de-genomique/fastxtend/issues.
## ACKNOWLEDGMENTS
Stefan Engelen, Cyril Firmo and Jean-Marc Aury - Fastxtend's authors
This work was financially supported by the Genoscope,
Institut de Genomique, CEA and Agence Nationale de la
Recherche (ANR), and France Génomique (ANR-10-INBS-09-08).