Merge branch 'viash-hub:main' into main

emmarousseau · Sep 18, 2024 · 7d8e588 · 7d8e588
2 parents 3af66f8 + 619f1bb
commit 7d8e588
Show file tree

Hide file tree

Showing 109 changed files with 12,175 additions and 177 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,20 @@
 # biobox x.x.x
 
+## NEW FUNCTIONALITY
+
+* `agat`:
+  - `agat/agat_convert_genscan2gff`: convert a genscan file into a GFF file (PR #100).
+
+* `bd_rhapsody/bd_rhapsody_sequence_analysis`: BD Rhapsody Sequence Analysis CWL pipeline (PR #96).
+
+* `rsem/rsem_calculate_expression`: Calculate expression levels (PR #93).
+
+## MINOR CHANGES
+
+* Upgrade to Viash 0.9.0.
+
+# biobox 0.2.0
+
 ## BREAKING CHANGES
 
 * `star/star_align_reads`: Change all arguments from `--camelCase` to `--snake_case` (PR #62).
@@ -29,13 +44,27 @@
 * `bedtools`:
   - `bedtools/bedtools_intersect`: Allows one to screen for overlaps between two sets of genomic features (PR #94).
   - `bedtools/bedtools_sort`: Sorts a feature file (bed/gff/vcf) by chromosome and other criteria (PR #98).
+  - `bedtools/bedtools_genomecov`: Compute the coverage of a feature file (bed/gff/vcf/bam) among a genome (PR #128).
+  - `bedtools/bedtools_groupby`: Summarizes a dataset column based upon common column groupings. Akin to the SQL "group by" command (PR #123).
+  - `bedtools/bedtools_merge`: Merges overlapping BED/GFF/VCF entries into a single interval (PR #118).
   - `bedtools/bedtools_bamtofastq`: Convert BAM alignments to FASTQ files (PR #101).
   - `bedtools/bedtools_bedtobam`: Converts genomic feature records (bed/gff/vcf) to BAM format (PR #111).
+  - `bedtools/bedtools_bed12tobed6`: Converts BED12 files to BED6 files (PR #140).
+  - `bedtools/bedtools_links`: Creates an HTML file with links to an instance of the UCSC Genome Browser for all features / intervals in a (bed/gff/vcf) file (PR #137).
 
 * `qualimap/qualimap_rnaseq`: RNA-seq QC analysis using qualimap (PR #74). 
 
 * `rsem/rsem_prepare_reference`: Prepare transcript references for RSEM (PR #89).
 
+* `bcftools`:
+  - `bcftools/bcftools_concat`: Concatenate or combine VCF/BCF files (PR #145).
+  - `bcftools/bcftools_norm`: Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows (PR #144).
+  - `bcftools/bcftools_annotate`: Add or remove annotations from a VCF/BCF file (PR #143).
+  - `bcftools/bcftools_stats`: Parses VCF or BCF and produces a txt stats file which can be plotted using plot-vcfstats (PR #142).
+  - `bcftools/bcftools_sort`: Sorts BCF/VCF files by position and other criteria (PR #141).
+
+* `fastqc`: High throughput sequence quality control analysis tool (PR #92).
+
 ## MINOR CHANGES
 
 * `busco` components: update BUSCO to `5.7.1` (PR #72).
@@ -123,14 +152,24 @@
     - `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53).
 
 * `umi_tools`:
-    -`umi_tools/umi_tools_extract`: Flexible removal of UMI sequences from fastq reads (PR #71).
+    - `umi_tools/umi_tools_extract`: Flexible removal of UMI sequences from fastq reads (PR #71).
+    - `umi_tools/umi_tools_prepareforrsem`: Fix paired-end reads in name sorted BAM file to prepare for RSEM (PR #148).
 
 * `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43).
 
 * `bedtools`:
     - `bedtools_getfasta`: extract sequences from a FASTA file for each of the
                            intervals defined in a BED/GFF/VCF file (PR #59).
 
+* `sortmerna`: Local sequence alignment tool for mapping, clustering, and filtering rRNA from metatranscriptomic 
+               data. (PR #146)
+
+* `fq_subsample`: Sample a subset of records from single or paired FASTQ files (PR #147).
+
+* `kallisto`:
+    - `kallisto_index`: Create a kallisto index (PR #149).
+
+
 ## MINOR CHANGES
 
 * Uniformize component metadata (PR #23).

diff --git a/_viash.yaml b/_viash.yaml
@@ -7,7 +7,7 @@ links:
   issue_tracker: https://github.com/viash-hub/biobox/issues
   repository: https://github.com/viash-hub/biobox
 
-viash_version: 0.9.0-RC7
+viash_version: 0.9.0
 
 config_mods: |
   .requirements.commands := ['ps']
diff --git a/src/agat/agat_convert_genscan2gff/config.vsh.yaml b/src/agat/agat_convert_genscan2gff/config.vsh.yaml
@@ -0,0 +1,95 @@
+name: agat_convert_genscan2gff
+namespace: agat
+description: |
+  The script takes a GENSCAN file as input, and will translate it in gff
+  format. The GENSCAN format is described [here](http://genome.crg.es/courses/Bioinformatics2003_genefinding/results/genscan.html).
+  
+  **Known problem** 
+
+  You must have submited only DNA sequence, without any header!! Indeed the tool expects only DNA
+  sequences and does not crash/warn if an header is submited along the
+  sequence. e.g If you have an header ">seq" s-e-q are seen as the 3 first
+  nucleotides of the sequence. Then all prediction location are shifted
+  accordingly. (checked only on the [online version](http://argonaute.mit.edu/GENSCAN.html). 
+  I don't know if there is the same problem elsewhere.)
+keywords: [gene annotations, GFF conversion, GENSCAN]
+links:
+  homepage: https://github.com/NBISweden/AGAT
+  documentation: https://agat.readthedocs.io/en/latest/tools/agat_convert_genscan2gff.html
+  issue_tracker: https://github.com/NBISweden/AGAT/issues
+  repository: https://github.com/NBISweden/AGAT
+references: 
+  doi: 10.5281/zenodo.3552717
+license: GPL-3.0
+requirements:
+  - commands: [agat]
+authors:
+  - __merge__: /src/_authors/leila_paquay.yaml
+    roles: [ author, maintainer ]
+
+argument_groups:
+  - name: Inputs
+    arguments:
+      - name: --genscan
+        alternatives: [-g]
+        description: Input genscan bed file that will be converted.
+        type: file
+        required: true
+        direction: input
+  - name: Outputs
+    arguments:       
+      - name: --output
+        alternatives: [-o, --out, --outfile, --gff]
+        description: Output GFF file. If no output file is specified, the output will be written to STDOUT.
+        type: file
+        direction: output
+        required: true
+        example: output.gff
+  - name: Arguments
+    arguments:
+      - name: --source
+        description: |
+          The source informs about the tool used to produce the data and is stored in 2nd field of a gff file. Example: Stringtie, Maker, Augustus, etc. [default: data]
+        type: string
+        required: false
+        example: Stringtie
+      - name: --primary_tag
+        description: |
+          The primary_tag corresponds to the data type and is stored in 3rd field of a gff file. Example: gene, mRNA, CDS, etc. [default: gene]
+        type: string
+        required: false
+        example: gene
+      - name: --inflate_type
+        description: |
+          Feature type (3rd column in gff) created when inflate parameter activated [default: exon].
+        type: string
+        required: false
+        example: exon
+      - name: --verbose
+        description: add verbosity
+        type: boolean_true
+      - name: --config
+        alternatives: [-c]
+        description: |
+          AGAT config file. By default AGAT takes the original agat_config.yaml shipped with AGAT. The `--config` option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
+        type: file
+        required: false
+        example: custom_agat_config.yaml
+resources:
+  - type: bash_script
+    path: script.sh
+test_resources:
+  - type: bash_script
+    path: test.sh
+  - type: file
+    path: test_data
+engines:
+  - type: docker
+    image: quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0
+    setup:
+      - type: docker
+        run: |
+          agat --version | sed 's/AGAT\s\(.*\)/agat: "\1"/' > /var/software_versions.txt
+runners:
+  - type: executable
+  - type: nextflow
diff --git a/src/agat/agat_convert_genscan2gff/help.txt b/src/agat/agat_convert_genscan2gff/help.txt
@@ -0,0 +1,94 @@
+```sh
+agat_convert_genscan2gff.pl --help
+```
+ ------------------------------------------------------------------------------
+|   Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0                      |
+|   https://github.com/NBISweden/AGAT                                          |
+|   National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se         |
+ ------------------------------------------------------------------------------
+
+Name:
+    agat_convert_genscan2gff.pl
+
+Description:
+    The script takes a genscan file as input, and will translate it in gff
+    format. The genscan format is described here:
+    http://genome.crg.es/courses/Bioinformatics2003_genefinding/results/gens
+    can.html /!\ vvv Known problem vvv /!\ You must have submited only DNA
+    sequence, wihtout any header!! Indeed the tool expects only DNA
+    sequences and does not crash/warn if an header is submited along the
+    sequence. e.g If you have an header ">seq" s-e-q are seen as the 3 first
+    nucleotides of the sequence. Then all prediction location are shifted
+    accordingly. (checked only on the online version
+    http://argonaute.mit.edu/GENSCAN.html. I don't know if there is the same
+    pronlem elsewhere.) /!\ ^^^ Known problem ^^^^ /!\
+
+Usage:
+        agat_convert_genscan2gff.pl --genscan infile.bed [ -o outfile ]
+        agat_convert_genscan2gff.pl -h
+
+Options:
+    --genscan or -g
+            Input genscan bed file that will be convert.
+
+    --source
+            The source informs about the tool used to produce the data and
+            is stored in 2nd field of a gff file. Example:
+            Stringtie,Maker,Augustus,etc. [default: data]
+
+    --primary_tag
+            The primary_tag corresponf to the data type and is stored in 3rd
+            field of a gff file. Example: gene,mRNA,CDS,etc. [default: gene]
+
+    --inflate_off
+            By default we inflate the block fields (blockCount, blockSizes,
+            blockStarts) to create subfeatures of the main feature
+            (primary_tag). Type of subfeature created based on the
+            inflate_type parameter. If you don't want this inflating
+            behaviour you can deactivate it by using the option
+            --inflate_off.
+
+    --inflate_type
+            Feature type (3rd column in gff) created when inflate parameter
+            activated [default: exon].
+
+    --verbose
+            add verbosity
+
+    -o , --output , --out , --outfile or --gff
+            Output GFF file. If no output file is specified, the output will
+            be written to STDOUT.
+
+    -c or --config
+            String - Input agat config file. By default AGAT takes as input
+            agat_config.yaml file from the working directory if any,
+            otherwise it takes the orignal agat_config.yaml shipped with
+            AGAT. To get the agat_config.yaml locally type: "agat config
+            --expose". The --config option gives you the possibility to use
+            your own AGAT config file (located elsewhere or named
+            differently).
+
+    -h or --help
+            Display this helpful text.
+
+Feedback:
+  Did you find a bug?:
+    Do not hesitate to report bugs to help us keep track of the bugs and
+    their resolution. Please use the GitHub issue tracking system available
+    at this address:
+
+                https://github.com/NBISweden/AGAT/issues
+
+     Ensure that the bug was not already reported by searching under Issues.
+     If you're unable to find an (open) issue addressing the problem, open a new one.
+     Try as much as possible to include in the issue when relevant:
+     - a clear description,
+     - as much relevant information as possible,
+     - the command used,
+     - a data sample,
+     - an explanation of the expected behaviour that is not occurring.
+
+  Do you want to contribute?:
+    You are very welcome, visit this address for the Contributing
+    guidelines:
+    https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md
diff --git a/src/agat/agat_convert_genscan2gff/script.sh b/src/agat/agat_convert_genscan2gff/script.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+
+set -eo pipefail
+
+## VIASH START
+## VIASH END
+
+# unset flags
+[[ "$par_inflate_off" == "true" ]] && unset par_inflate_off
+[[ "$par_verbose" == "false" ]] && unset par_verbose
+
+# run agat_convert_genscan2gff
+agat_convert_genscan2gff.pl \
+  --genscan "$par_genscan" \
+  --output "$par_output" \
+  ${par_source:+--source "${par_source}"} \
+  ${par_primary_tag:+--primary_tag "${par_primary_tag}"} \
+  ${par_inflate_off:+--inflate_off} \
+  ${par_inflate_type:+--inflate_type "${par_inflate_type}"} \
+  ${par_verbose:+--verbose} \
+  ${par_config:+--config "${par_config}"}
diff --git a/src/agat/agat_convert_genscan2gff/test.sh b/src/agat/agat_convert_genscan2gff/test.sh
@@ -0,0 +1,35 @@
+#!/bin/bash
+
+set -eo pipefail
+
+## VIASH START
+## VIASH END
+
+test_dir="${meta_resources_dir}/test_data"
+
+# create temporary directory and clean up on exit
+TMPDIR=$(mktemp -d "$meta_temp_dir/$meta_name-XXXXXX")
+function clean_up {
+ [[ -d "$TMPDIR" ]] && rm -rf "$TMPDIR"
+}
+trap clean_up EXIT
+
+echo "> Run $meta_name with test data"
+"$meta_executable" \
+  --genscan "$test_dir/test.genscan" \
+  --output "$TMPDIR/output.gff" 
+
+echo ">> Checking output"
+[ ! -f "$TMPDIR/output.gff" ] && echo "Output file output.gff does not exist" && exit 1
+
+echo ">> Check if output is empty"
+[ ! -s "$TMPDIR/output.gff" ] && echo "Output file output.gff is empty" && exit 1
+
+echo ">> Check if output matches expected output"
+diff "$TMPDIR/output.gff" "$test_dir/agat_convert_genscan2gff_1.gff"
+if [ $? -ne 0 ]; then
+  echo "Output file output.gff does not match expected output"
+  exit 1
+fi
+
+echo "> Test successful"
diff --git a/src/agat/agat_convert_genscan2gff/test_data/agat_convert_genscan2gff_1.gff b/src/agat/agat_convert_genscan2gff/test_data/agat_convert_genscan2gff_1.gff
@@ -0,0 +1,25 @@
+##gff-version 3
+unknown	genscan	gene	2223	4605	75.25	+	.	ID=gene_1
+unknown	genscan	mRNA	2223	4605	75.25	+	.	ID=mrna_1;Parent=gene_1
+unknown	genscan	exon	2223	3020	75.25	+	.	ID=exon_1;Parent=mrna_1
+unknown	genscan	exon	4249	4605	13.03	+	.	ID=exon_2;Parent=mrna_1
+unknown	genscan	CDS	2223	3020	75.25	+	0	ID=cds_1;Parent=mrna_1
+unknown	genscan	CDS	4249	4605	13.03	+	0	ID=cds_2;Parent=mrna_1
+unknown	genscan	gene	6829	8789	20.06	-	.	ID=gene_2
+unknown	genscan	mRNA	6829	8789	20.06	-	.	ID=mrna_2;Parent=gene_2
+unknown	genscan	exon	6829	7297	20.06	-	.	ID=exon_3;Parent=mrna_2
+unknown	genscan	exon	7730	7888	12.78	-	.	ID=exon_4;Parent=mrna_2
+unknown	genscan	exon	8029	8185	7.45	-	.	ID=exon_5;Parent=mrna_2
+unknown	genscan	exon	8278	8546	17.45	-	.	ID=exon_6;Parent=mrna_2
+unknown	genscan	exon	8647	8789	18.65	-	.	ID=exon_7;Parent=mrna_2
+unknown	genscan	CDS	6829	7297	20.06	-	1	ID=cds_3;Parent=mrna_2
+unknown	genscan	CDS	7730	7888	12.78	-	1	ID=cds_4;Parent=mrna_2
+unknown	genscan	CDS	8029	8185	7.45	-	2	ID=cds_5;Parent=mrna_2
+unknown	genscan	CDS	8278	8546	17.45	-	1	ID=cds_6;Parent=mrna_2
+unknown	genscan	CDS	8647	8789	18.65	-	0	ID=cds_7;Parent=mrna_2
+unknown	genscan	gene	10209	11924	16.18	+	.	ID=gene_3
+unknown	genscan	mRNA	10209	11924	16.18	+	.	ID=mrna_3;Parent=gene_3
+unknown	genscan	exon	10209	11313	16.18	+	.	ID=exon_8;Parent=mrna_3
+unknown	genscan	exon	11850	11924	3.27	+	.	ID=exon_9;Parent=mrna_3
+unknown	genscan	CDS	10209	11313	16.18	+	0	ID=cds_8;Parent=mrna_3
+unknown	genscan	CDS	11850	11924	3.27	+	2	ID=cds_9;Parent=mrna_3
diff --git a/src/agat/agat_convert_genscan2gff/test_data/script.sh b/src/agat/agat_convert_genscan2gff/test_data/script.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+
+# clone repo
+if [ ! -d /tmp/agat_source ]; then
+  git clone --depth 1 --single-branch --branch master https://github.com/NBISweden/AGAT /tmp/agat_source
+fi
+
+# copy test data
+cp -r /tmp/agat_source/t/scripts_output/in/test.genscan src/agat/agat_convert_genscan2gff/test_data/test.genscan
+cp -r /tmp/agat_source/t/scripts_output/out/agat_convert_genscan2gff_1.gff src/agat/agat_convert_genscan2gff/test_data/agat_convert_genscan2gff_1.gff
+