Modify SAM spec document to better represent restrictions in the SEQ field #800

Psy-Fer · 2024-10-31T03:47:13Z

Currently, the sam spec document (https://samtools.github.io/hts-specs/SAMv1.pdf) has the following for the mandatory SEQ field

10 SEQ String *|[A-Za-z=.]+ segment SEQuence

This doesn't really match up with the IUPAC base restriction within the BAM spec, and so anything that is not IUPAC is converted to N.

This isn't clear in the document.

I propose this should be updated to avoid confusion.

Cheers,
James

jkbonfield · 2024-10-31T09:26:13Z

Agreed. I think it was rather lazy to avoid spelling out the exact list of accepted characters. It would be interesting to test what other SAM parsers do when it comes to non-IUPAC codes.

I should also note that IUPAC was explicitly designed with U as one of the possible base types. You'll note there are common themes in some of the codes. Weak vs Strong (W/S), Purine and Pyridmine (R, Y), etc. One of themes is not-A (C,G,T), not-C, not-G, not-T. These use the next letter along in the alphabet, so not-A is B, not-C is D, not-G is H and not-T is... V! It could have been U, but that's reserved already as T and U are synonymous (and note there is no IUPAC code for not-U). So right back when IUPAC was being designed it was already accepted that the two were essentially interchangeable with the meta-data on the sequence dictating which is it. This makes sense as both DNA and RNA still only have an alphabet of 4 symbols.

So... we should probably also expand the text in BAM to state that T (or U) is 8.

jmarshall · 2024-10-31T09:48:57Z

Context is this thread. I also suspect this was a simple regex chosen for its brevity rather than its accuracy. If someone does want to claim that all of A-Z must be taken as valid in SAM SEQ because that's what the spec says, I think it's incumbent on them to explain what Z should mean… 😄

jkbonfield · 2024-10-31T09:52:59Z

Agreed. I think it was rather lazy to avoid spelling out the exact list of accepted characters. It would be interesting to test what other SAM parsers do when it comes to non-IUPAC codes.

So picard 2.26 (I can't use v3 as I don't have a modern enough Java available, but I doubt it's changed) with U gives me this:

Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. Invalid character in read bases; File _.sam; Line 2

Setting the validation stringency to silent allows ViewSam to process the file and it's reported back as-is, but then it also allows #, X, and other non-IUPAC characters.

With U in there, the conversion to BAM fails regardless of validation stringency.

java -jar ~/work/picard-2.26.11.jar SamFormatConverter O=_.bam I=_.sam VALIDATION_STRINGENCY=SILENT
Exception in thread "main" java.lang.IllegalStateException: Bad base passed to charToCompressedBaseLow: #(35) in read: r1

In Htslib all non IUPAC codes are silently converted to N. (Prior to my recent change that included U, which IMO was a mistake as IUPAC does acknowledge U as a real base type.) I haven't checked Noodles on the command line, but I see the BAM encoder doesn't permit U.

What this is suggesting to me is that while the spec may well say one thing, the implementations behave very differently. None (until very recently) would store U in bam AS t and at least two dislike U in SAM. I made the change in htslib because ONT are apparently already producing SAM in the wild with U in it, so something needed to be done to cope with real-world data, but I was remiss in not bringing this up here for spec clarification. Thanks for raising this issue.

My take on it is the spec should be clearer and define precisely the list of characters to something that is compatible with BAM. We should also add a recommendation that U is treated as T to aid parsing of data now being produced, but it may be problematic to make it a rule unless we bump the SAM minor version number (which is also not satisfactory as it's the BAM part of the SAM spec that needs fixing, and that doesn't have anything to do with the SAM VN header field). The issue of how to track the meta-data (ie source material type) is best dealt with over in #801.

jkbonfield · 2024-10-31T10:01:28Z

Context is this thread. I also suspect this was a simple regex chosen for its brevity rather than its accuracy. If someone does want to claim that all of A-Z must be taken as valid in SAM SEQ because that's what the spec says, I think it's incumbent on them to explain what Z should mean… 😄

Ironically it was ONT's use of Z in fastq that spurred me on to come up with a formal way of representing methylation in SAM/BAM! They were initially using Z as 5mC. Ghastly and never going to fly beyond a few custom tools.

Anyway, regarding limiting of the regexp I think we're all pretty much on the same page.

jmarshall transferred this issue from samtools/samtools Oct 31, 2024

jmarshall added the sam label Oct 31, 2024

jkbonfield mentioned this issue Oct 31, 2024

Represent U in SAM rather than T for reads from RNA #801

Open

This was referenced Oct 31, 2024

Accept U as a valid base type and convert to T for BAM. samtools/htsjdk#1728

Open

Accept U as a base type and convert to T for BAM. zaeleus/noodles#306

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify SAM spec document to better represent restrictions in the SEQ field #800

Modify SAM spec document to better represent restrictions in the SEQ field #800

Psy-Fer commented Oct 31, 2024

jkbonfield commented Oct 31, 2024 •

edited

Loading

jmarshall commented Oct 31, 2024

jkbonfield commented Oct 31, 2024 •

edited

Loading

jkbonfield commented Oct 31, 2024

Modify SAM spec document to better represent restrictions in the SEQ field #800

Modify SAM spec document to better represent restrictions in the SEQ field #800

Comments

Psy-Fer commented Oct 31, 2024

jkbonfield commented Oct 31, 2024 • edited Loading

jmarshall commented Oct 31, 2024

jkbonfield commented Oct 31, 2024 • edited Loading

jkbonfield commented Oct 31, 2024

jkbonfield commented Oct 31, 2024 •

edited

Loading

jkbonfield commented Oct 31, 2024 •

edited

Loading