Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupt SAM/BAM file due to NULL injected into MD tag with two-pass #2226

Open
btownshend opened this issue Oct 8, 2024 · 0 comments
Open

Comments

@btownshend
Copy link

I encountered a situation where the output from STAR was not readable by samtools, causing 'samtools view' to give an exception due to "corrupt" aux data:
[E::sam_format1_append] Corrupted aux data for read LH00192:82:22GMCNLT3:1:1128:12373:5198_TCGATGTCGGCTCT
It took a while to track down and to get it to occur in SAM output to be able to see why, but what appears to be happening is that there are some null's embedded in the MD tag causing premature ending of the tag data. This results in parsers assuming that the remaining part of the MD tag is another tag, which turns out to have an invalid type code.
Looking at the SAM line that gives the problem doesn't show it immediately:

2421241:LH00192:82:22GMCNLT3:1:1128:12373:5198_TCGATGTCGGCTCT 83 NT_187594.1 191674 255 104S14M19S = 191674 -14 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTCTCAAAAAAAAAAACGACACCCCAGATGTCTGAGTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCACCGACAACAACCTTTCTGCTTCACAATGATAGGA IIIIII9III9999I99I9999999999999-------9I9--9-I------9-9-I-I9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NH:i:1 HI:i:1 AS:i:21 nM:i:0 NM:i:3 MD:Z:11000

But looking at a hex dump of that line shows that it ends with:
... 0 \t N M : i : 3 \t M D : Z : 1 1 \0 0 \0 0 \0 0 \n

This only occurs in two-pass mode and I think it has something to do with the alignment occurring at the very end of the chr NT_187594.1 -- it seems to be reading past the end of that template and grabbing some nulls to use in the MD tag.

Note that this is pretty hard to reproduce -- I only saw it on one file that had ~10M reads before hitting this -- shortening the input removed the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant