-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restrict allowed VCF Contig ID chars the same way as SAM RNAME (and allow colons) #379
Conversation
7ca0409
to
cd869bf
Compare
cd869bf
to
857fab8
Compare
+1 |
The part of this PR that allows colons is backwards compatible, but the further restrictions on contig names are not. I am totally in favor of introducing this change, but are we happy to do so in VCF v4.3? We have observed that it doesn't affect a big number of sequences. |
Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of samtools#124. Fixes samtools#258. Closes samtools#291.
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
857fab8
to
df03971
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of merging this into 4.3. It's technically a breaking change, but I think it's a necessary one and one that the majority is already suffering with.
Thanks for merging this, but please remember to take care about when to squash commits or not. This PR had one uncontroversial allowing-colons commit and one more-concerning restricting-punctuation commit, so per MAINTAINERS.md would ideally have been kept as separate changes. |
Ok. I did the squash because they seemed close enough to me (and there was unanimous agreement on both), and to avoid rebasing to your branch and avoiding the merge commit. I will lower the threshold of "different changes" in the future. |
Suggested starting point for aligning the VCF contig ID rules with January's changes to punctuation characters allowed in SAM reference sequence names (see PR #333).
The first commit removes the prohibition on colons on the basis that there is in fact no problem with parsing breakends. If this was the only colon problem suspected (and it seems Problems with allowed contig names in the SAM spec #124, Existing VCF spec is incompatible with certain hg38 contig names #258, and Resolving ambiguities introduced when supporting colons in hg38 names #291 have found no others), it should be uncontroversial to relax this in all VCF versions.
The second commit makes the VCF contig ID rules the same as the new SAM RNAME rules. This has several effects:
it makes
*
valid when it is not the first character in the ID. (Some HLA alleles contain such non-first*
characters.) I believe the relevant symbolic alleles we're trying to avoid clashes with are<*>
and/or*
so relaxing to all non-first*
doesn't affect that clash avoidance.Making comma (
,
) invalid fixes New Restrictions on Contig Names not Strict Enough #167.It newly disallows
\ , "'`` () {}
characters in contig names. This is the only potentially controversial part of this change, as it deems invalid things that were previously valid.But this may not be a problem in practice: the SAM statistics previously collected showed that restricting these particular characters was no problem for SAM in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, these statistics are reassuring; or you may wish to collect VCF statistics.
This is a starting point. Among things the VCF maintainers may wish to consider adding are:
Re colons, you may wish to refer readers to the SAM spec's appendix on parsing
name:beg-end
region notation in the presence of colon-containing names. Or you may wish to include such an appendix of your own.I tend to think of the CHROM column as coming from a
##contig
header, but perhaps that is not always the case. So perhaps VCFv4.3 §1.6.1 Fixed fields's CHROM entry should mention that this identifier must also always match §1.4.7 Contig field format's regular expression.In §5.4 Breakends, you may wish to spell out explicitly that the final colon in
chr:pos
is to be parsed as the mate breakend part's colon, never part of a contig ID (cf Existing VCF spec is incompatible with certain hg38 contig names #258 (comment) and especially Existing VCF spec is incompatible with certain hg38 contig names #258 (comment)).Other further descriptive text.
Whether to induce a VCF version number change with these changes. For SAM, as the restrictions on backslashes, commas, quotes, and brackets made no difference to 99.9%+ of files observed in the wild, we decided no new format version was warranted.