Skip to content

collapse

Adam English edited this page Aug 18, 2021 · 21 revisions

truvari collapse

Will collapse all variants within sizemin/max that match over thresholds. All variants outside size boundaries will be placed into the output.

Collapse will iterate over a single VCF and 'collapse' variants that match one another above the provided thresholds. This is useful when removing potentially redundant calls or merging calls between samples/software.

Merging SVs is a tricky business. Most approaches will simply use distances or reciprocal overlap of calls to merge them. There may be a more informed way to merge calls, especially when they're sequence-resolved breakpoint exact calls. This command helps do a more sophisticated merge.

To start, we can merge multiple VCFs (each with their own sample) using bcftools merge -m none one.vcf two.vcf | bgzip > merge.vcf. This will paste SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. For example:

>> one.vcf:
chr1 1 ... GT 0/1
chr1 5 ... GT 1/1
>> two.vcf:
chr1 1 ... GT 1/1
chr1 7 ... GT 0/1

Creates:

>> merge.vcf:
chr1 1 ... GT 0/1 1/1
chr1 5 ... GT 1/1 ./.
chr1 7 ... GT ./. 0/1    

This single VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. For example, if we collapsed the above example merge.vcf by matching any calls within 3bp, we'd create:

>> merge.collapse.vcf
chr1 1 ... GT 0/1 1/1
chr1 5 ... GT 1/1 0/1

When collapsing, the first variant from a matching set of variants will be written to the output while the others will be placed in the collapsed output.

Samples with no genotype information in the first variant will be filled by the first collapsed variant containing genotype information.

When using --hap, we assume phased variants from a single individual. Only the single best matching call from the other haplotype will be collapsed, and the consolidated genotype will become 1/1

For example, if we collapse anything at the same position:

chr1 1 .. GT 0|1
chr1 1 .. GT 1|0
chr1 2 .. GT 1|0

will become:

chr1 1 .. GT 1/1
chr1 2 .. GT 1|0

When using --chain mode, instead of collapsing all variants matching the first variant together, we'll collapse all variants in a matching set together. For example, if we have

chr1 5 ..
chr1 7 ..
chr1 9 ..

When we collapse anything within 2bp of each other, without --chain, we output:

chr1 5 ..
chr1 9 ..

With --chain, we would collapse chr1 9 as well, producing

chr1 5 ..
usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE]
                [-k {first,maxqual,common}] [--debug] [-r REFDIST] [-p PCTSIM]
                [-B BUFFER] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap]
                [--chain] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN]
                [-S SIZEMAX] [--passonly]

Structural variant collapser

Will collapse all variants within sizemin/max that match over thresholds
All variants outside size boundaries will be placed into the output

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Comparison set of calls
  -o OUTPUT, --output OUTPUT
                        Output vcf (stdout)
  -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT
                        Where collapsed variants are written (collapsed.vcf)
  -f REFERENCE, --reference REFERENCE
                        Indexed fasta used to call variants
  -k {first,maxqual,common}, --keep {first,maxqual,common}
                        When collapsing calls, which one to keep (first)
  --debug               Verbose logging
  --hap                 Collapsing a single individual's haplotype resolved
                        calls (False)
  --chain               Chain comparisons to extend possible collapsing
                        (False)
  --null-consolidate NULL_CONSOLIDATE
                        Comma separated list of FORMAT fields to consolidate
                        into the kept entry by taking the first non-null from
                        all neighbors (None)

Comparison Threshold Arguments:
  -r REFDIST, --refdist REFDIST
                        Max reference location distance (500)
  -p PCTSIM, --pctsim PCTSIM
                        Min percent allele sequence similarity. Set to 0 to
                        ignore. (0.95)
  -B BUFFER, --buffer BUFFER
                        Percent of the reference span to buffer the haplotype
                        sequence created
  -P PCTSIZE, --pctsize PCTSIZE
                        Min pct allele size similarity (minvarsize/maxvarsize)
                        (0.95)
  -O PCTOVL, --pctovl PCTOVL
                        Minimum pct reciprocal overlap (0.0) for DEL events
  -t, --typeignore      Variant types don't need to match to compare (False)
  --use-lev             Use the Levenshtein distance ratio instead of edlib
                        editDistance ratio (False)

Filtering Arguments:
  -s SIZEMIN, --sizemin SIZEMIN
                        Minimum variant size to consider for comparison (50)
  -S SIZEMAX, --sizemax SIZEMAX
                        Maximum variant size to consider for comparison
                        (50000)
  --passonly            Only consider calls with FILTER == PASS
Clone this wiki locally