Release 3: Maria Bogaerts [email protected], Josefa Gonzalez [email protected]
T-lex is a computational pipeline that detects presence and/or absence of annotated individual transposable elements (TEs) using next-generation sequencing (NGS) data.
Version 3:
- In the PRESENCE module, MAQ is changed by BWA-MEM
- Filtering steps are added to increase the accuracy of genotyping
- New parameters that allow the user to control the minimum and maximum number of reads, and the minimum number of strains to genotype a TE insertion
- Fixed several bugs present in the previous version
Previous versions:
T-lex Fiston-Lavier et al., 2015 |
Paper |
Repository |
- Unix system with Perl version 5.10.0 or higher | Perl |
- RepeatMasker and libraries open-3.2.8 or higher (Smit 1996-2010) | RepeatMasker | Libraries |
- SHRIMP2 - Short Read Mapping Package- Release 2.2.1/Oct. 31, 2011 (David et al 2011) | SHRIMP2 |
- BLAT version 34 or greater (Kent 2002) | BLAT |
- Samtools version 1.4 or higher (Li et al 2009)
- Bcftools version 1.3.1 or higher (Li et al 2009)
- BWA version 0.7.15 or higher (Li and Durbin 2009)
- FastX toolkit 0.0.14 or higher | FastX |
Only for the Target Site Duplication (TSD) detection:
- Phrap version 1.090518 ("Phil's Revised Assembly Program" (Green, 1999) | Phrap |
- For downloading this program, you may need to contact the original authors to get permission
- FastaGrep developed by the Department of Bioinformatics at the University of Tartu & Estonian Biocentre | FastaGrep |
T-lex requires at least four input data:
File | Description |
---|---|
TE list | the list of the TEs to be analyzed with one TE identifier per line (e.g for a TE in Drosophila 'FBti0019293') |
TE annotations | the annotations of these TEs. Tabulated file with five columns: TE name / location / start nucleotide position/ end nucleotide position / strand (+ or -), e.g. 'FBti0019293 3R 405387 406627 - ' |
reference genome | the reference genomic sequences where TEs have been annotated in fasta format (e.g. '>3R XXXXX') |
NGS data | the NGS data in official fastq format |
NGS data has to be stored in one directory in which each subdirectory corresponds to a single strain NGS data such as:
[input strain directory]/
[strain name 1]/
[strain name 1].fastq
[strain name 2]/
[strain name 2].fastq
To handle paired-end reads, separate the reads from a same pair ([strain name]) in two fastq files ([strain name]_1.fastq and [strain name]_2.fastq). Each fastq file name should be r enamed as 'XXX_X.fastq'.
[input strain directory]/
[strain name]/
[strain name]_1.fastq
[strain name]_2.fastq
T-lex3 is composed by five different modules. All the modules can be run in the same job, excepting calculating frequencies for individual strains and TSD detection (see OPTIONS). Each module can be also run independently in case only one part of the analysis is required.
General usage:
tlex-open-v3.pl [ Options ] [ -T TE list ] [ -M TE annotations ] [ -G reference genome ] [ -R NGS data ]
First time running T-lex3 (for different options, see "OPTIONS/PARAMETERS"):
-
General run for one strain: TE-filtering, presence, absence, combine
perl tlex-open-v3.pl \ -O projectname \ -A 95 \ -pairends yes \ -s 'drosophila' \ -T absolute_path/TElist_file.txt \ -M absolute_path/TEannotation_file.txt \ -G absolute_path/genome_file.fa \ -R absolute_path/fastq_files/strain
Output: path_to_folder/tlex_projectname/Tresults
-
Frequency estimation in several individual strains (NOTE: projectname is required if different strains are run, otherwise the tlex_output folder will be overwritten by the following strain)
perl tlex-open-v3.pl -combData
Output: path_to_folder/Tfreqs_output/Tresults
perl tlex-open-v3.pl -freq
Output: path_to_folder/Tfreqs_output/Tfreq
-
General run in a pool strain including frequency estimation
perl tlex-open-v3.pl \ -O projectname \ -A 95 \ -pairends yes \ -s 'drosophila' \ -pooled \ -T absolute_path/TElist_file.txt \ -M absolute_path/TEannotation_file.txt \ -G absolute_path/genome_file.fa \ -R absolute_path/fastq_files/strain
Output: path_to_folder/tlex_projectname/Tresults Output: path_to_folder/tlex_projectname/Tfreq
-
TSD analysis in any strain (requires ABSENT module alredy run)
perl tlex-open-v3.pl \ -tsd \ -align \ -p \ -O projectname \ -A 95 \ -pairends yes \ -s 'drosophila' \ -T absolute_path/TElist_file.txt \ -M absolute_path/TEannotation_file.txt \ -G absolute_path/genome_file.fa \ -R absolute_path/fastq_files/strain
Output: path_to_folder/Tannot_TSDdetection
If only one of the Presence or Absence modules is run, results are obtained in the following paths:
path_to_folder/tlex_projectname/Tresults_present
path_to_folder/tlex_projectname/Tresults_absent
After the first time run we recommend the user:
- T-lex generates a list with filtered TEs, TElist_file.txt_filtered, which is located in the same folder as the original input TE list. We recommend to use this new file as input list parameter, and bypass the TE-analysis with the command -noFilterTE (see options).
- T-lex generates a new file according with the TSD detection. We recommend changing the annotation input file according with this information and use this new file as input annotation parameter. For following runs, the TSD module could be bypassed.
Parameters | Type | Description |
---|---|---|
General parameters: | ||
-A | int | maximum read length in the data set in bp ( default: 100 ) |
-O | string | project name (default: tlex_output) |
-R | string | NGS data in FASTQ path |
-G | string | Absolute path to reference genome |
-T | string | Absolute path to TE list |
-M | string | Absolute path to TE annotation file |
-noclean | keep the intermediate files | |
-binreads | bypasses the generation of data in the absence and presence module | |
-h or -help | display this help | |
For the TE filtering step: | ||
-s | string | name of the species studied ( for RepeatMasker, e.g. 'drosophila' ) |
-d | float | minimum repeat density at the flanking regions of the TEs ( default: 0.5 (50%); the default parameter is set according to Drosophila melanogaster genome, we recommend changing it deppending on the specie (i.e. for humans we recommend 1 (100%) ) |
-noFilterTE | do not filter TEs | |
Module specific parameters: | ||
PRESENCE | ||
-q | launch only the presence detection approach | |
-j | int | length of the junction sequences to extract in bp ( default: 1000 ) |
-b | int | length of the internal region of the TE in bp ( default: 60 ) |
-limp | int | minimum match length required with the TE sequence in bp ( default: 15 ) |
-id | int | minimum sequence identity required with the TE sequence in % ( default: 95 ) |
-pairends | string | PE mapping at 'no' by default ('yes' or 'no') |
-minQ | int | minimum quality Phrep score required to select a read ( default: 30 ) |
-processes | int | number of processes (used only for the NGS data reformatting; default: 1 ) |
ABSENCE | ||
-p | launch only the absence detection approach | |
-f | int | length of the flanking sequences to extract and concatenate in bp ( default: 125 ) |
-v | int | minimum read length spanning the two TE sides in bp ( default: 20 ) |
-pairends | string | PE mapping at 'no' by default ('yes' or 'no') |
-lima | int | minimum non-repeated region on each side of the sequence in bp ( default: 10 ) |
-binref | bypasses the generation of the new reference genome without insertions | |
COMBINE | ||
-combRes | combine the presence/absence results from one strain ( REQUIRED: PRESENCE and ABSENCE modules ) (this parameter only needs to be specified if you have run the PRESENCE and ABSENCE modules independently and you want to combine those results) | |
-combData | combine the presence/absence results from different strains ( REQUIRED: at least two individual strains Tresults data file )(this parameter can only be run once there are more than 1 individual strains already run with the normal pipeline) | |
-combAll | combine the frequency estimates with the TE breakpoint analysis ( for pooled data only ) | |
TSD (requires the ABSENCE module '-p' ) | ||
-align | return the multiple alignments | |
-tsd | return the TSDs for the TE insertions detected as absent | |
POOL FREQUENCIES | ||
-freq | return the TE frequency based on the given strains | |
-pooled | return the TE frequency based on pooled data ( to use with the opti | |
-minR | int | minimum number of reads to calculate frequency ( default: 3 ) |
-maxR | int | maximum number of reads to calculate frequency ( default: 90 ) |
INDIVIDUAL FREQUENCIES (after using combData for combining all individual strain results) | ||
-freq | return the TE frequency based on the given strains | |
-minP | minimum number of TE calls to estimate the TE frequency ( default: 1 ) |
A small dataset example is provided. This dataset correspond only to a few TEs in an individual Drosophila melanogaster strain. In addition, different results for the combination of individual strains are available. Run the following command lines in the github folder.
Get the absolute path:
absolute_path=$(pwd)
Decompress genome example:
cd example
tar -xvzf genome_example.fa.tar.gz
cd ..
Running T-lex:
perl tlex-open-v3.pl \
-O example \
-A 95 \
-pairends yes \
-T $absolute_path/example/TElist_example.txt \
-M $absolute_path/example/TEannotation_example.txt \
-G $absolute_path/example/genome_example.fa \
-R $absolute_path/example/fastq_files/example
For TSD detection:
perl tlex-open-v3.pl \
-tsd \
-align \
-p \
-O example \
-A 95 \
-pairends yes \
-T $absolute_path/example/TElist_example.txt \
-M $absolute_path/example/TEannotation_example.txt \
-G $absolute_path/example/genome_example.fa \
-R $absolute_path/example/fastq_files/example
Combination of different strains (run it in folder example/tlex_results/):
perl ../../tlex-open-v3.pl \
-combData
T-lex produces several of output directories and files. The output is stored in a working directory named by default: 'tlex_output' or 'tlex_[project name]'. By default, only the final results (the 'Tresults' file) and the data necessary for the manual curation (the 'Tanalysis' and 'Talign' sub-directories) are returned.
-
TE flanking sequence analysis:
The 'Tanalysis' sub-directory includes two RepeatMasker output files: One contains the submitted sequences in which the repeats have been masked (.masked). The repeated regions are represented by 'N's'. The other file contains the table summarizing the detected repeats (.out). Tanalysis also includes the detection of poly-A tails by looking for strech of 'A' or 'T' in three prime of the TE flanking sequence. The files are organized such as:Tanalysis/ Tflank_checking_[length of the flanking region analyzed].fasta.masked Tflank_checking_[length of the flanking region analyzed].fasta.out Tpoly_[length of the flanking region analyzed].fasta.polyAT
-
TE detection results
The 'Tresults' file contains the TE presence/absence detection results. Information about the read and repeat density at the flanking regions of each TE are also returned. This file is composed of 20 columns described below:Column Field Description 1 strain name of the strain 2 TE name of the TE 3 TE absence detection TE detection result from the absence detection approach 4 TE presence detection TE detection result from the presence detection approach 5 TE detection conclusion final result combining the results from the two approaches 6 absence read number number of reads spanning the two TE flanking sequences 7 left match length length of the terminal match overlapping the TE sequence of the left contig. This value can be negative if the full TE sequence is missing 8 left match id equence identity of the terminal match overlapping the TE sequence of the left contig. The number of mismatches/matches is reported in parenthesis 9 poly(A) detection on left TE side poly(A) detection result after analysis of the left TE side. If no poly(A) or poly(T) detected = 'no_polyAT' 10 left_coverage mean read coverage at the left flanking side of the TE 11 left_repeat name of the repeat located at the left side of the TE. If no repeat is detected = 'no repeat' 12 presence read number left number of reads spanning the TE left junction flanking sequences 13 presence read number left filtered number of reads spanning the TE left junction flanking sequences taking into account for pool frequency estimation 14 right match length length of the terminal match overlapping the TE sequence of the right contig. This value can be negative if the full TE sequence is missing 15 right match id sequence identity of the terminal match overlapping the TE sequence of the right contig. The number of mismatches/matches is reported in parenthesis 16 poly(A) detection on right TE side poly(A) detection result after analysis of the right TE side. If no poly(A) or poly(T) detected = 'no_polyAT' 17 right_coverage mean read coverage at the right flanking side of the TE 18 right_repeat name of the repeat located at the right side of the TE. If no repeat is detected = 'no repeat' 19 presence read number right number of reads spanning the TE right junction flanking sequences 20 presence read number right filtered number of reads spanning the TE right junction flanking sequences taking into account for pool frequency estimation -
TE frequency estimates Using the combination of the results, T-lex estimates the frequency for each TE in the population dividing the number of strains for which the TE is present by the total number of strains for which T-lex returns data. The 'Tfreq' file will be located in a folder called "Tfreqs_output" and contains 6 columns:
Column Field Description 1 TE TE name 2 presence results number of strains for which T-lex classified the TE as 'present' 3 polymorphic results number of strains for which T-lex classified the TE as 'polymorphic' 4 absence results number of strains for which T-lex classified the TE as 'absent' 5 total results number of strains for which T-lex returns a result 6 TE frequency TE frequency estimate (i.e. adding the number of strains for which the TE is 'present' and one-half times the number of 'polymorphic' strains, and dividing by the total number of strains for which T-lex returns results)
Using the option '-pooled', T-lex can also estimate the frequency for each TE in the population using pooled NGS data Fiston-Lavier et al. 2011. This estimate is based on the number of reads supporting the absence and the presence of the TE insertions reported by T-lex. The frequency estimates are stored in the file called 'Tfreq_pooled'. This file contains five columns such as:
Column | Field | Description |
---|---|---|
1 | TE | Name of the TE |
2 | absence read number | Number of reads providing evidence of absence |
3 | presence read number/left | Number of reads providing evidence of presence on the left TE side |
4 | presence read number/right | Number of reads providing evidence of presence on the right TE side |
5 | TE frequency | TE frequency estimate(i.e. Sum of the frequency estimates using each TE detection approach (see Fiston-Lavier et al. 2011) |
To be completed by the users.
1. T-lex3 has stopped before finishing.
Try to increase the RAM used for the program. With some species, T-lex requires more RAM to execute the program.
2. Absence module is working but Presence module is not.
Check if the fastq input files are properly placed in the correct folders.
- RepeatMasker. version Open 3.2.8 A.F.A. Smit, R. Hubley & P. Green RepeatMasker here.
- Jurka,J., Kapitonov,V.V., Pavlicek,A., Klonowski,P., Kohany,O., Walichiewicz,J. (2005) Repbase update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 110:462_467
- Li,H., Ruan,J., Durbin,R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18:1851_1858.
- Rumble,S.M., Lacroute,P., Dalca,A.V., Fiume,M., Sidow,A., Brudno,M. (2009) SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Comput. Biol. 5(5): e1000386. doi:10.1371/journal.pcbi.1000386.
- David M, Dzamba M, Lister D, Ilie L, Brudno M. (2011) SHRiMP2: sensitive yet practical SHort Read Mapping. Bioinformatics. Apr 1;27(7):1011-2. Epub 2011 Jan 28.
- Kent W.J. (2002) BLAT - the BLAST-like alignment tool. Genome Res. 12:656-664
- de la Bastide M, McCombie WR. (2007) Assembling genomic DNA sequences with PHRAP. Curr Protoc Bioinformatics. 2007 Mar;Chapter 11:Unit11.4.
- Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
- Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]
- Fiston-Lavier AS, Carrigan M, Petrov DA and Gonzalez J. T-LEX: A program for fast and accurate assessment of transposable element presence using next-generation sequencing data. Nuc. Acids. Res. 2011 Mar 1;39(6):e36. Epub 2010 Dec 21
- Fiston-Lavier AS, Barron MG, Petrov DA, Gonzalez J. T-lex2: genotyping, frequency estimation and re-annotation of transposable elements using single or pooled next-generation sequencing data. Nucleic Acids Res. 2015;43(4):e22. pmid:25510498