Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify SAM spec document to better represent restrictions in the SEQ field #800

Open
Psy-Fer opened this issue Oct 31, 2024 · 4 comments
Open
Labels

Comments

@Psy-Fer
Copy link

Psy-Fer commented Oct 31, 2024

Currently, the sam spec document (https://samtools.github.io/hts-specs/SAMv1.pdf) has the following for the mandatory SEQ field

10 SEQ String *|[A-Za-z=.]+ segment SEQuence

image

This doesn't really match up with the IUPAC base restriction within the BAM spec, and so anything that is not IUPAC is converted to N.

This isn't clear in the document.

I propose this should be updated to avoid confusion.

Cheers,
James

@jmarshall jmarshall transferred this issue from samtools/samtools Oct 31, 2024
@jmarshall jmarshall added the sam label Oct 31, 2024
@jkbonfield
Copy link
Contributor

jkbonfield commented Oct 31, 2024

Agreed. I think it was rather lazy to avoid spelling out the exact list of accepted characters. It would be interesting to test what other SAM parsers do when it comes to non-IUPAC codes.

I should also note that IUPAC was explicitly designed with U as one of the possible base types. You'll note there are common themes in some of the codes. Weak vs Strong (W/S), Purine and Pyridmine (R, Y), etc. One of themes is not-A (C,G,T), not-C, not-G, not-T. These use the next letter along in the alphabet, so not-A is B, not-C is D, not-G is H and not-T is... V! It could have been U, but that's reserved already as T and U are synonymous (and note there is no IUPAC code for not-U). So right back when IUPAC was being designed it was already accepted that the two were essentially interchangeable with the meta-data on the sequence dictating which is it. This makes sense as both DNA and RNA still only have an alphabet of 4 symbols.

So... we should probably also expand the text in BAM to state that T (or U) is 8.

@jmarshall
Copy link
Member

Context is this thread. I also suspect this was a simple regex chosen for its brevity rather than its accuracy. If someone does want to claim that all of A-Z must be taken as valid in SAM SEQ because that's what the spec says, I think it's incumbent on them to explain what Z should mean… 😄

@jkbonfield
Copy link
Contributor

jkbonfield commented Oct 31, 2024

Agreed. I think it was rather lazy to avoid spelling out the exact list of accepted characters. It would be interesting to test what other SAM parsers do when it comes to non-IUPAC codes.

So picard 2.26 (I can't use v3 as I don't have a modern enough Java available, but I doubt it's changed) with U gives me this:

Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Invalid character in read bases; File _.sam; Line 2

Setting the validation stringency to silent allows ViewSam to process the file and it's reported back as-is, but then it also allows #, X, and other non-IUPAC characters.

With U in there, the conversion to BAM fails regardless of validation stringency.

java -jar ~/work/picard-2.26.11.jar SamFormatConverter O=_.bam I=_.sam VALIDATION_STRINGENCY=SILENT
Exception in thread "main" java.lang.IllegalStateException: Bad base passed to charToCompressedBaseLow: #(35) in read: r1

In Htslib all non IUPAC codes are silently converted to N. (Prior to my recent change that included U, which IMO was a mistake as IUPAC does acknowledge U as a real base type.) I haven't checked Noodles on the command line, but I see the BAM encoder doesn't permit U.

What this is suggesting to me is that while the spec may well say one thing, the implementations behave very differently. None (until very recently) would store U in bam AS t and at least two dislike U in SAM. I made the change in htslib because ONT are apparently already producing SAM in the wild with U in it, so something needed to be done to cope with real-world data, but I was remiss in not bringing this up here for spec clarification. Thanks for raising this issue.

My take on it is the spec should be clearer and define precisely the list of characters to something that is compatible with BAM. We should also add a recommendation that U is treated as T to aid parsing of data now being produced, but it may be problematic to make it a rule unless we bump the SAM minor version number (which is also not satisfactory as it's the BAM part of the SAM spec that needs fixing, and that doesn't have anything to do with the SAM VN header field). The issue of how to track the meta-data (ie source material type) is best dealt with over in #801.

@jkbonfield
Copy link
Contributor

Context is this thread. I also suspect this was a simple regex chosen for its brevity rather than its accuracy. If someone does want to claim that all of A-Z must be taken as valid in SAM SEQ because that's what the spec says, I think it's incumbent on them to explain what Z should mean… 😄

Ironically it was ONT's use of Z in fastq that spurred me on to come up with a formal way of representing methylation in SAM/BAM! They were initially using Z as 5mC. Ghastly and never going to fly beyond a few custom tools.

Anyway, regarding limiting of the regexp I think we're all pretty much on the same page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants