Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict allowed VCF Contig ID chars the same way as SAM RNAME (and allow colons) #379

Merged
merged 2 commits into from
Mar 3, 2020

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented Feb 5, 2019

Suggested starting point for aligning the VCF contig ID rules with January's changes to punctuation characters allowed in SAM reference sequence names (see PR #333).

  1. The first commit removes the prohibition on colons on the basis that there is in fact no problem with parsing breakends. If this was the only colon problem suspected (and it seems Problems with allowed contig names in the SAM spec #124, Existing VCF spec is incompatible with certain hg38 contig names #258, and Resolving ambiguities introduced when supporting colons in hg38 names #291 have found no others), it should be uncontroversial to relax this in all VCF versions.

  2. The second commit makes the VCF contig ID rules the same as the new SAM RNAME rules. This has several effects:

    • it makes * valid when it is not the first character in the ID. (Some HLA alleles contain such non-first * characters.) I believe the relevant symbolic alleles we're trying to avoid clashes with are <*> and/or * so relaxing to all non-first * doesn't affect that clash avoidance.

    • Making comma (,) invalid fixes New Restrictions on Contig Names not Strict Enough #167.

    • It newly disallows \ , "'`` () {} characters in contig names. This is the only potentially controversial part of this change, as it deems invalid things that were previously valid.

      But this may not be a problem in practice: the SAM statistics previously collected showed that restricting these particular characters was no problem for SAM in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, these statistics are reassuring; or you may wish to collect VCF statistics.

This is a starting point. Among things the VCF maintainers may wish to consider adding are:

  • Re colons, you may wish to refer readers to the SAM spec's appendix on parsing name:beg-end region notation in the presence of colon-containing names. Or you may wish to include such an appendix of your own.

  • I tend to think of the CHROM column as coming from a ##contig header, but perhaps that is not always the case. So perhaps VCFv4.3 §1.6.1 Fixed fields's CHROM entry should mention that this identifier must also always match §1.4.7 Contig field format's regular expression.

  • In §5.4 Breakends, you may wish to spell out explicitly that the final colon in chr:pos is to be parsed as the mate breakend part's colon, never part of a contig ID (cf Existing VCF spec is incompatible with certain hg38 contig names #258 (comment) and especially Existing VCF spec is incompatible with certain hg38 contig names #258 (comment)).

  • Other further descriptive text.

  • Whether to induce a VCF version number change with these changes. For SAM, as the restrictions on backslashes, commas, quotes, and brackets made no difference to 99.9%+ of files observed in the wild, we decided no new format version was warranted.

VCFv4.1.tex Outdated Show resolved Hide resolved
@hts-specs-bot
Copy link

Changed PDFs as of 857fab8: VCFv4.1 (diff), VCFv4.2 (diff).

@samtools samtools deleted a comment from hts-specs-bot May 29, 2019
@pd3
Copy link
Member

pd3 commented Jun 4, 2019

+1

@cyenyxe
Copy link
Member

cyenyxe commented Jun 24, 2019

The part of this PR that allows colons is backwards compatible, but the further restrictions on contig names are not. I am totally in favor of introducing this change, but are we happy to do so in VCF v4.3? We have observed that it doesn't affect a big number of sequences.

Breakend notation always includes a ":pos" part, so breakends are
unambiguous even if the "chr" in "chr:pos" also itself contains colons.

As this is a relaxation of the previous rules, there is no concern
about altering all three 4.1/4.2/4.3 specs.

Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
Disallow \ , "`' (){} punctuation characters in VCF contig IDs.
The characters []<> were already disallowed in VCF; this also relaxes
the prohibition of * to merely disallowing initial *.

Statistics gathered from various reference sequence archives suggest
that the characters restricted appear vanishingly infrequently in SAM
reference sequence names in existing files in the wild. To the extent
that all contig IDs in VCF files come from corresponding SAM/BAM files,
this means there is little concern about making the same restrictions
in VCF contig IDs.

Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously
fixed by PR samtools#333.
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of merging this into 4.3. It's technically a breaking change, but I think it's a necessary one and one that the majority is already suffering with.

@jmmut jmmut merged commit 0ef79c0 into samtools:master Mar 3, 2020
@jmarshall
Copy link
Member Author

Thanks for merging this, but please remember to take care about when to squash commits or not. This PR had one uncontroversial allowing-colons commit and one more-concerning restricting-punctuation commit, so per MAINTAINERS.md would ideally have been kept as separate changes.

@jmmut
Copy link
Contributor

jmmut commented Mar 3, 2020

Ok. I did the squash because they seemed close enough to me (and there was unanimous agreement on both), and to avoid rebasing to your branch and avoiding the merge commit. I will lower the threshold of "different changes" in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New Restrictions on Contig Names not Strict Enough
7 participants