-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disallow commas and other punctuation in RNAME etc #333
Conversation
d131782
to
16cff42
Compare
I'm assuming that What will the folks that have |
I mean to re-collect the statistics on this PR, but IIRC all the commas observed in the various repositories are from one malformatted file, as noted in #291 (comment). So such folks have not been observed to exist. |
16cff42
to
f306c75
Compare
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Add previously omitted SQ-AN history note.
f306c75
to
8c27bc4
Compare
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Add previously omitted SQ-AN history note.
Previous discussion of punctuation characters in RNAMEs can be found in #124 (HLA and colons), #167 (VCF and commas/equals), #220 (SQ-AN; statistics), #258 (VCF and HLA/colons), #291 (VCF breakends; statistics). This table collects the statistics of punctuation character counts from various archives of reference FASTA files:
Re commas, @daviesrob reported that all the commas he observed originate from one malformatted file. @yfarjoun, @cyenyxe: Is something similar the case for the other two repositories? |
I was previously thinking it was important that implementations shouldn't need to enforce several different subsets of characters, and therefore that we shouldn't mark this change via HD-VN. Now I don't think that's nuanced enough. Based on conversation at last month's meeting, the considerations are:
So implementations should accept commas etc for HD-VN≤1.5 but diagnose commas etc for HD-VN≥1.6. (Assuming that they essentially copy VN when copying records around but emit the current spec VN when creating new data files.) It would be up to the implementation whether this means silence for ≤ 1.5 and a warning for ≥ 1.6, or warning≤1.5 and error≥1.6, or etc. (The SQ-AN subset was more restricted prior to this PR, but I would suggest that there is no point enforcing this for ≤1.5.) If a ≤1.5 file has a Probably VN:1.6 is rare enough in the wild that using it to signal the intention to comply with this restricted RNAME subset is fine. Or we might perhaps want to bump the spec to VN:1.7 to signal this. If others agree with this ≤1.5 / ≥1.6 approach, I'll add text to this effect to the history appendix entry. Consideration (1) means that we have only this one chance to get the subset selection right! |
8c27bc4
to
7505297
Compare
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Add previously omitted SQ-AN history note.
I'm OK with outlawing In the end it's only 3 references that have them, and i think that we can deal. I'd like to note that * is in hg38...so I think the ship has sailed FAR with that one. |
Do we need to extend the regexp notation? My interpretation of POSIX is we'd normally use things like This in turn means if we chose the left hand side of the rname as |
I like the inclusion of appendix A - it's not specification, but guidance on tool chain usage which feels appropriate. Do we also want to expand on potential disambiguation techniques? Eg given we're outlawing curly braces, we could also permit {rname}:1-2 to be equivalent to rname:1-2, thus permitting |
I like the proposed hint for disambiguation. it should probably be put into the spec so that different tools can all use it. |
For completeness here, if I recall in the last File Formats meeting @jmarshall explained the logic for a new regex syntax. My understanding is it is a preference to define the characters that may be used, with an exception for the first character, rather than swapping these around purely for the convenience of sticking with standard regexps. Really this is just a case of regexp being a poor choice compared to a grammar. I'm not entirely convinced with the argument, but I can see it has some benefit too so I'll accept it if others are like minded. Apart from that foible, I'm happy with this PR. |
I think in the previous meeting we agreed with #333 (comment) so I'll add a bit talking about VN foibles. And that I'll expand the appendix to suggest a convention of The thing about regexp syntax comes down to clarity of exposition. For me, the downside of inventing our own tiny piece of ad hoc regexp-related syntax is outweighed by upside of being able to express the set of rname characters directly:
I think it's clear that “we write this set of characters” refers to the set described in (1), and even if that's not clear to a particular reader the meaning is made clear by having (2) there too. I tried it both ways around (and several other ways too) and I believe that this way leads to the clearest exposition, which is why it's in the proposal. James's understanding described in the previous comment is correct; people who remain displeased by the idea of inventing ad hoc regexp syntax should workshop this themselves and see if they can come up with something clearer. 😄 |
While we are here...should we protect the sequence name |
Hmm, that feels a bit too implementation specific! I was wondering though recently about classes of names that we may wish to reserve for specific purposes. I'd like a way of inventing reference names, eg by crude hash based or minimiser based denovo assembly of unmapped data. These would need entries in the SQ headers too, and have to be named. Eg |
given that htsget is currently grappling with how to access unplaced reads, I don't think that it is implementation specific.... #320 |
This is a question about in-band vs. out-of-band signalling (i.e., do we allow for special values that are syntactically the same as normal values). In programming languages, this is called reserved words vs. identifiers, and the usual approaches are grandfathering in the list of reserved words right from the start or adding new reserved words by using distinguishing syntax (e.g. a leading IMHO it's very late in the day to be thinking about adding some reserved words to our RNAME rules. Particularly as SAM has always reserved all RNAMEs starting with As for htsget:— there has been a discussion about whether to use SAM's already distinguished-syntax |
Technically SAM doesn't reserve words starting with "*" or "=", but words that are entirely "*" or "=". I think "implementation defined" is a good way of dealing with them too. So we can say "*" and "=" have reserved meanings, anything starting "*" or "=" is implementation dependent, and everyone else you have to honour as valid user data. I've given a use case I hope to put into practice already here - fake references for data clustering purposes. |
The current spec has long said that You are correct that the RNAME and RNEXT fields assign particular meanings to words that are in their entirety "*" or "=". However due to that regexp the spec also reserves all RNAMEs starting with |
Ah yes, you're correct. I hadn't spotted * was missing from the RNAME regex. This is inherently the problem of having negated regexps done this way; it's hard to remember my ASCII collating order to know what is between ) and +. The new regexp is a huge improvement in this regard too. :-) |
* implementing the spec change in samtools/hts-specs#333 * this disallows a number of characters from reference sequence names, these are characters like ',' which do not appear in standard references and which cause parsing issues if they are allowed
I've just pushed a draft of those two things. I think we're all in agreement on the basic thrust of this PR, so once that's reviewed and wordsmithed, I think this is ready to merge! It wasn't that easy to compose something useful to say about the VN foibles and note that contrary to the suggested text in this latest commit, samtools/htsjdk#1238 as merged strictly enforces the RNAME regexp regardless of the input file's @HD-VN value. Hmmm… |
@jkbonfield @yfarjoun — please review this last commit about {rname}:1-2 delimiter syntax and VN. If we agree with adding this to VN:1.6 (rather than bumping to a new VN:1.7), I think this is ready for merging. |
I'm fine with it and if I recall the bracing style matches the in the languishing samtools/htslib#708 PR. I don't think we need a version number bump (but I wouldn't feel overly concerned if we decided it warranted it). |
e38e9b0
to
f36d071
Compare
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291. Add appendix describing parsing `name:beg-end` when name allows colons: pseudocode description of algorithm to detect ambiguous input, as proposed in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end` delimited notation. Add previously omitted SQ-AN history note.
Rebased on to current master. It appears that we've never really particularly discussed the suggested candidates for punctuation to disallow! After removing the characters disallowed by this pull request, we are left with the following accepted RNAME punctuation characters:
Of the dubious ones only There would be something to be gained from disallowing some of these dubious characters ( So I mention this for completeness, and propose to merge this as is, with the somewhat clear and memorable rule being: backslash-comma-quotes-brackets forbidden… |
It's a great step forward, and if we later rethink it further it can be a separate PR. However realistically I'm not so worried about shell and regexp metacharacters as they can be escaped. The main concern we had was actual functionality of the specification being broken - eg commas being used elsewhere in strings that contain names, angle brackets in VCF contig names, etc. Besides, if we forbid ? then how am I going to get my ANSI trigraphs through now? ;-) |
It is indeed a great step forward. I really don't want to restrict it further in future though, as that wouldn't be fair to users or to implementations (#333 (comment) consideration (1)). |
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence names. Commas and angle brackets are used to delimit refnames in other SAM fields (e.g. SA) and in VCF files, and restricting these other characters facilitates future delimiter and quoting syntax. Statistics gathered from various reference sequence archives suggest that these characters appear vanishingly infrequently in refnames in existing files in the wild. Fixes the SAM aspects of samtools#124, samtools#167, samtools#258, and samtools#291. Add appendix describing parsing `name:beg-end` when name allows colons: pseudocode description of algorithm to detect ambiguous input, as proposed in a comment on samtools#124; suggest also accepting an alternative `{name}:beg-end` delimited notation. Add previously omitted SQ-AN history note.
f36d071
to
7fd686b
Compare
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
….file to sanitize fasta file IDs fasta ID is sanitized for picard CreateSequenceDictionary to adhere to character set restrictions in SAM/BAM RNAME spec. see: samtools/hts-specs#333 Roll back test input now that we are doing the sanitization. As of this commit, sanitization is only performed on fasta files destined for CreateSequenceDictionary
* update picard 2.18.11 -> 2.20.3 * remove certain characters from unit test input fasta that are disallowed by picard commas, single quotes, etc. * compare only reads in sam post-depletion compare only reads in sam post-depletion, allowing us to ignore header differences due to varying versions of picard/samtools/htslib; remove related permutations of expected output from tests * update expected output for depletion db build tests * use regex to allow for flexible header comparison in picard test * In TestFastqBam, resolve longstanding brittle test case add new test assertion, "assertEqualSamHeaders" to address this inline ToDo for TestFastqBam(): "Note for developers: if you're fixing the tests to handle non-bugs (ie our testing here is too brittle), let's just replace a lot of this in the future with code that just reads the header, sorts it, and tests for equality of sorted values in the RG line (and stricter equality in the non-header lines). This is kind of hacky." * add debug statements to install-viral-ngs.sh script add debug statements to install-viral-ngs.sh Docker script * install viral-ngs dependencies to separate conda env from base env * let's try conda 4.7.10 * unpin conda-build and jinja2 * bump conda-build * do not install gcc * pin CONDA_R=3.4.1 * conda build --debug * let's try pinning R to the latest since something is pulling it in? * remove --debug from conda-build * bump r version required by GATK 3.8 conda package * r-base * r 3.5.1 * matplotlib 2.2.0 -> 2.2.4 * bedtools 2.27.1 -> 2.28.0 * blast 2.6.0 -> 2.7.1 * roll back test input to state of origin/master; add functions to util.file to sanitize fasta file IDs fasta ID is sanitized for picard CreateSequenceDictionary to adhere to character set restrictions in SAM/BAM RNAME spec. see: samtools/hts-specs#333 Roll back test input now that we are doing the sanitization. As of this commit, sanitization is only performed on fasta files destined for CreateSequenceDictionary * switch to lz4-bin from conda-forge change lz4 dependency from lz4-bin from bioconda to lz4-c from conda-forge, v131->v1.9.1 * picard 2.20.3 -> 2.20.5 * krona 2.7 -> 2.7.1 * override channels for tool install, each time * switch to picard-slim package switch to picard-slim package (sans 'r'); specify USE_JDK_DEFLATER=true and USE_JDK_INFLATER=true until bug in Intel deflator is fixed to prevent sporadic crashes * update temp suffix for sanitized fastas * revert to corrected test input revert to corrected test input
Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously fixed by PR samtools#333.
…llow colons) (#379) * Allow colons in VCF Contig IDs: breakend notation is unambiguous Breakend notation always includes a ":pos" part, so breakends are unambiguous even if the "chr" in "chr:pos" also itself contains colons. As this is a relaxation of the previous rules, there is no concern about altering all three 4.1/4.2/4.3 specs. Fixes the VCF/colon aspects of #124. Fixes #258. Closes #291. * Restrict allowed VCF Contig ID chars to those allowed in SAM RNAMEs Disallow \ , "`' (){} punctuation characters in VCF contig IDs. The characters []<> were already disallowed in VCF; this also relaxes the prohibition of * to merely disallowing initial *. Statistics gathered from various reference sequence archives suggest that the characters restricted appear vanishingly infrequently in SAM reference sequence names in existing files in the wild. To the extent that all contig IDs in VCF files come from corresponding SAM/BAM files, this means there is little concern about making the same restrictions in VCF contig IDs. Fixes #124 and fixes #167 for VCF; their SAM aspects were previously fixed by PR #333.
The suggestion here is to disallow
\ , "'`` () <> [] {}
, i.e., backslashes, commas, various quote marks, and various brackets. Hopefully this is consistent with the statistics listed in #291. Additional counts for some of these characters would be useful — see #333 (comment) below.The existing spec uses regular expressions containing ASCII ranges, which is fairly opaque. This PR lists the punctuation characters explicitly and then invents some extra character class notation to refer to the set of allowed characters succinctly and hopefully clearly: