-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for reading SAM/BAM files with contigs larger than max int #1788
Comments
For the record, bam indexing seems to break at 2^29 (samtools/hts-specs#35). But the length requested by the original post is very very large (it's almost the size of the entire human reference....just one contig!) |
Thank you Yossi. |
The BAI format is limited to 229 due to its unchangeable minshift=14 / depth=5 geometry. This limitation is not inherent in the indexing scheme, and CSI can use geometries that allow maximum coordinates larger than 232 to be indexed. IMHO the 32-bit ranges for LN / POS / PNEXT / TLEN that Yossi quoted from the SAM spec are a bit misleading. In BAM these fields are signed 32 bits so the values are indeed limited to 231 in BAM, but in SAM these fields are just text strings of digits so there is no inherent reason for them to be limited in range at all. I'm not sure whether these fields are limited to 32 bits in CRAM. So really these 32-bit ranges only necessarily apply to BAM\1 (see samtools/hts-specs#240). So an implementer could choose to represent chromosome positions internally as e.g. 64 bits, read and write them with their full range in SAM and (perhaps) CRAM, and raise an error if an attempt is made to write a position greater than 231 as BAM. HTSlib does this since samtools/htslib#709, which landed in 1.10. (On reading further, I think current CRAM is also limited to 231 but CRAMv4 will in future lift this to 64 bits.) |
Hi, @yfarjoun |
Firstly: this is completely experimental!!!! I have not done this!!! Someone else may have, but not I. I do know that something like this was done for the circular mitochondrial genome, for very different reasons. There's no tool (that I'm aware of) that will do this, here's what I would do if I had to do this:
|
Hi, @yfarjoun |
Hello, Could you please tell me if contig size is still limited for running Picard MarkDuplicates? I work mainly with amphibian genomes, which tend to have large chromosomes. Recently, when I tried to run Picard MarkDuplicates, I got the following error: [Tue Jun 06 11:46:58 AEST 2023] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.02 minutes. I've read several older threads that discuss contig size limits. One said that contigs must be <512MB and another that sam contig size limit is 2^31. The largest contig in my dataset exceeds the Byte limit, but not the sam contig size limit. This contig is 1,251,518,389bp and 1,272MB. Thank you, |
Instructions
In #1783 a user described a situation in which MarkDuplicates failed because the size of the contig in their BAM was larger than the maximum value of an int (the value in question was 2364278061).
This ticket is to investigate modifications to htsjdk and picard to add support for increasing the allowable size for a contig to be greater than an int.
Bug Report
Affected tool(s)
Tool name(s), special parameters?
Affected version(s)
Description
Describe the problem below. Provide screenshots , stacktrace , logs where appropriate.
Steps to reproduce
Tell us how to reproduce this issue. If possible, include command lines that reproduce the problem and provide a minimal test case.
Expected behavior
Tell us what should happen
Actual behavior
Tell us what happens instead
Feature request
Tool(s) involved
Tool name(s), special parameters?
Description
Specify whether you want a modification of an existing behavior or addition of a new capability.
Provide examples, screenshots, where appropriate.
Documentation request
Tool(s) involved
Tool name(s), parameters?
Description
Describe what needs to be added or modified.
The text was updated successfully, but these errors were encountered: