-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify SAM spec document to better represent restrictions in the SEQ field #800
Comments
Agreed. I think it was rather lazy to avoid spelling out the exact list of accepted characters. It would be interesting to test what other SAM parsers do when it comes to non-IUPAC codes. I should also note that IUPAC was explicitly designed with U as one of the possible base types. You'll note there are common themes in some of the codes. Weak vs Strong (W/S), Purine and Pyridmine (R, Y), etc. One of themes is not-A (C,G,T), not-C, not-G, not-T. These use the next letter along in the alphabet, so not-A is B, not-C is D, not-G is H and not-T is... V! It could have been U, but that's reserved already as T and U are synonymous (and note there is no IUPAC code for not-U). So right back when IUPAC was being designed it was already accepted that the two were essentially interchangeable with the meta-data on the sequence dictating which is it. This makes sense as both DNA and RNA still only have an alphabet of 4 symbols. So... we should probably also expand the text in BAM to state that T (or U) is 8. |
Context is this thread. I also suspect this was a simple regex chosen for its brevity rather than its accuracy. If someone does want to claim that all of |
So picard 2.26 (I can't use v3 as I don't have a modern enough Java available, but I doubt it's changed) with U gives me this:
Setting the validation stringency to silent allows ViewSam to process the file and it's reported back as-is, but then it also allows #, X, and other non-IUPAC characters. With U in there, the conversion to BAM fails regardless of validation stringency.
In Htslib all non IUPAC codes are silently converted to N. (Prior to my recent change that included U, which IMO was a mistake as IUPAC does acknowledge U as a real base type.) I haven't checked Noodles on the command line, but I see the BAM encoder doesn't permit U. What this is suggesting to me is that while the spec may well say one thing, the implementations behave very differently. None (until very recently) would store U in bam AS t and at least two dislike U in SAM. I made the change in htslib because ONT are apparently already producing SAM in the wild with U in it, so something needed to be done to cope with real-world data, but I was remiss in not bringing this up here for spec clarification. Thanks for raising this issue. My take on it is the spec should be clearer and define precisely the list of characters to something that is compatible with BAM. We should also add a recommendation that U is treated as T to aid parsing of data now being produced, but it may be problematic to make it a rule unless we bump the SAM minor version number (which is also not satisfactory as it's the BAM part of the SAM spec that needs fixing, and that doesn't have anything to do with the SAM |
Ironically it was ONT's use of Z in fastq that spurred me on to come up with a formal way of representing methylation in SAM/BAM! They were initially using Z as 5mC. Ghastly and never going to fly beyond a few custom tools. Anyway, regarding limiting of the regexp I think we're all pretty much on the same page. |
Currently, the sam spec document (https://samtools.github.io/hts-specs/SAMv1.pdf) has the following for the mandatory SEQ field
10 SEQ String *|[A-Za-z=.]+ segment SEQuence
This doesn't really match up with the IUPAC base restriction within the BAM spec, and so anything that is not IUPAC is converted to N.
This isn't clear in the document.
I propose this should be updated to avoid confusion.
Cheers,
James
The text was updated successfully, but these errors were encountered: