-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
64-bit reference positions - ABI/API placeholder #709
Conversation
Would it be worth introducing a typedef — |
I was thinking that while writing this infact, but sadly after I was half way through and lacked will power to go back and change it all! I do think it's probably the right thing to do though, with associated PRIhts_pos macro for printing. |
dc5449d
to
0fc857f
Compare
TODO:
|
Rebased with additions to make headers work. Notes:
We need to test against some htslib-using packages to check that they still work correctly when built against this branch. I've tried samtools and bcftools, which work bar the items noted above. I've also tried pysam, which manages fairly well (it builds OK, although a few tests fail due to unrelated issues with illegal headers in a couple of test files). |
// also INT_MAX instead of -1. This avoids bugs with old code | ||
// using the new hts_pos_t data type. | ||
#define HTS_POS_MAX ((((int64_t)INT_MAX)<<32)|INT_MAX) | ||
#define HTS_POS_MIN INT64_MIN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the same issue with INT64_MIN
being reduced to 0 when cast into a 32-bit int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think HTS_POS_MIN
is there for completeness. It isn't actually returned by any functions in HTSlib, so you would only use it in code upgraded to work with type hts_pos_t
. It's also only meaningful for fields like isize
that could legitimately take a negative value.
if (hrecs->ref[i].len < UINT32_MAX) { | ||
bh->target_len[i] = hrecs->ref[i].len; | ||
} else { | ||
bh->target_len[i] = UINT32_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't sdict
be updated at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't need to because sam_hdr_tid2len()
uses the sam_hrecs_t::ref
array if available. So if target_len
is being updated from sam_hrecs_t::ref
, sdict
isn't going to be used any more and arguably it could actually be freed at this point.
htslib/hts.h
Outdated
typedef struct { | ||
uint32_t beg, end; | ||
//uint32_t beg, end; | ||
hts_pos_t beg, end; // sorry for the bad naming: FIXME! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hts_pair32_t
could be deprecated and its few uses replaced with hts_pair_pos_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly yes, although it's exposed in hts_reglist_t::intervals
and I see that samtools allocates one in bed_reglist()
.
Possibly we should make hts_pair32_t
a typedef of hts_pair_pos_t
.
@@ -1290,6 +1306,8 @@ typedef struct __bam_mplp_t *bam_mplp_t; | |||
int bam_plp_push(bam_plp_t iter, const bam1_t *b); | |||
const bam_pileup1_t *bam_plp_next(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp); | |||
const bam_pileup1_t *bam_plp_auto(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp); | |||
const bam_pileup1_t *bam_plp64_next(bam_plp_t iter, int *_tid, hts_pos_t *_pos, int *_n_plp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't bam_pileup1_t.qpos
(line 1283) be changed from int32_t
to hts_pos_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qpos
is on the read, not the reference. We're not attempting to support reads longer than 2Gbases at the moment.
3b905bc
to
f44aac1
Compare
Rebased with more additions:
Now is the time to seriously kick the tyres on this one, especially the VCF updates. It's looking ready to be merged, so if anyone has comments it would be best to add them sooner rather than later. See also samtools/samtools#1107 and samtools/bcftools#1083 which need to be merged before this PR so that they continue to work properly. |
sam.c
Outdated
@@ -3960,7 +4018,7 @@ void bam_plp_destructor(bam_plp_t plp, | |||
* Returns BAM_CMATCH, -1 when there is no more cigar to process or the requested position is not covered, | |||
* or -2 on error. | |||
*/ | |||
static inline int cigar_iref2iseq_set(uint32_t **cigar, uint32_t *cigar_max, int *icig, int *iseq, int *iref) | |||
static inline int cigar_iref2iseq_set(uint32_t **cigar, uint32_t *cigar_max, int *icig, int *iseq, hts_pos_t *iref) | |||
{ | |||
int pos = *iref; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64 bit to 32 bit conversion. Is this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it gets away with it because *iref
is set to an offset to the beginning of the read, so should fit within an int
(barring issues with very big BAM_CDEL
or BAM_CREF_SKIP
CIGAR operations). However it might be better to make cigar_iref2iseq_set
, cigar_iref2iseq_next
and tweak_overlap_quality
, which calls them, use hts_pos_t
all round as avoiding any possibility of overflow will make the code easier to understand.
htslib/vcf.h
Outdated
@@ -680,6 +683,7 @@ set to one of BCF_ERR* codes and must be checked before calling bcf_write(). | |||
* Returns 0 on success or negative value on error. | |||
*/ | |||
#define bcf_update_info_int32(hdr,line,key,values,n) bcf_update_info((hdr),(line),(key),(values),(n),BCF_HT_INT) | |||
#define bcf_update_info_int64(hdr,line,key,values,n) bcf_update_info((hdr),(line),(key),(values),(n),BCF_HT_LONG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should probably be a corresponding get macro at line 777:
#define bcf_get_info_int64(hdr,line,tag,dst,ndst) bcf_get_info_values(hdr,line,tag,(void**)(dst),ndst,BCF_HT_LONG)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. I also decided to make bcf_update_info_int64()
and bcf_get_info_int64()
static inline functions instead of macros, so we can ensure that they are used with int64_t
arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks fine to me. There are some cases of mixing int64_t with hts_pos_t but unless we move to the latter being int128_t I do not think it will cause a problem.
What we do need to do is update all the copyright dates in changed files. It is a requirement in our software policy.
htslib/vcf.h
Outdated
* @param n: number of values in the array. When set to 0, the INFO tag is removed | ||
* @return 0 on success or negative value on error. | ||
* | ||
* The @p string in bcf_update_info_flag() is optional, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph can be removed, as this documentation text only refers to bcf_update_info_int64
.
htslib/vcf.h
Outdated
* @param tag: INFO tag to retrieve | ||
* @param dst: *dst is pointer to a memory location, can point to NULL | ||
* @param ndst: pointer to the size of allocated memory | ||
* @return 0 on success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a discrepancy between the return code for success and the paragraph below, which states:
Returns negative value on error or the number of written values (including missing values) on success.
htslib/vcf.h
Outdated
* -2 .. clash between types defined in the header and encountered in the VCF record | ||
* -3 .. tag is not present in the VCF record | ||
* | ||
* Returns negative value on error or the number of written values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
test/sam.c
Outdated
bam1_t_size = 36 + sizeof(int) + 4 + sizeof (char *) + sizeof(uint64_t); | ||
assert(sizeof(hts_pos_t) == 8 || sizeof(hts_pos_t) == 4); | ||
int core_size = sizeof(hts_pos_t) == 8 ? 48 : 36; | ||
bam1_t_size = core_size + sizeof (int) + 4 + sizeof (char *) + sizeof(uint64_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although a test code, this looks a bit ugly. Maybe it can be changed to something like:
bam1_t_size = core_size + sizeof (int) + sizeof (char *) + sizeof (uint64_t) + (sizeof (uint32_t) << 1);
This includes the newly added mempolicy
member.
The in-memory data structures are 64-bit for pos, mate-pos and insert size, along with iterators. The code that fills these out is still all 32-bit so this is basically a place-holder for ABI purposes. The exception to this is SAM support, which being purely textual has the minimal changes necessary to read and write 64-bit values. Split the hts_parse_reg API to 32-bit and 64-bit variants (although 64 bit version is only used internally at the moment). To much code uses this with addresses of 32-bit quantities, so for compatibility hts_parse_reg() cannot change. 64 bit parse_reg uses a slightly tweaked value for the end for chromosomes with no range (eg "chr1"). Using INT64_MAX would yield -1 when cast into int. We now have nearly 64-bit max which when truncated to 32-bit is still INT_MAX. The only change needed in samtools to pass tests is fixing cur5 and pre5 in bam_mate.c.
This upgrades all the internal data types to 64-bit and adds I/O functions for encoding and decoding, but doesn't change the format itself. There is also code with #ifdef LARGE_POS **which should not be used** in production. This is there simply to act as a test for the 64-bit API in htslib iterators. The code is mainly copied from io_lib CRAM4 experimental branch: jkbonfield/io_lib@1150b9c
Added BETA and HUFFMAN support. (Beta can occasionally be used, althoug huffman is mainly for completeness now.) Also fixed the decoder2encoder logic (used by cram_transcode_rg).
Fixed the pileup iterators to internally use 64-bit states. The API for this returns via int *pos, so to keep API consistency we now have new bam_plp64_next, bam_plp64_auto and bam_mplp64_auto functions. Similarly for fai handling fai_fetch64 and faidx_fetch_seq64. Minor tweak to sam_cap_mapq and sam_prob_realn API. Pos parameter is passed by value so doesn't need a new API (promotion is enough), but code hasn't been curated yet. The implementation of these two functions needs more work to be 64-bit clean.
Also fixed missing dependency in bcf_sr_sort.o. CRAM is still using int64_t internally as this is referring to the *potential* on-disk format (with -DLARGE_POS) and none of the changes there are externally visible in the public API anyway. Include stdio.h in hts_defs.h so the mingw __MINGW_PRINTF_FORMAT gets defined in all the places where it's needed.
Make sam_hdr_tid2len() return hts_pos_t. Change length stored in sam_hrec_sq_t to hts_pos_t. Make sam header parser use strtoll() instead of atoi(). Unfortunately changing the size of the header target_len array is difficult as some external software attempts to resize it as a multiple of sizeof(uint32_t). Work around this by storing large values as UINT32_MAX and repurpose the sdict pointer (unused since 7a853e8) as a way of passing the real size through. Code that supports long references will need to use sam_hdr_tid2len() to get the length. Adds tests for reading and writing SAM files.
Swap tid and pos in bam1_core_t. Removes four bytes of padding between tid and pos, and four bytes between mtid and mpos. Reverse order of l_data, data and id in bam1_t. Removes four bytes of padding after l_data on 64-bit platforms. Adjust documentation to match new ordering.
Update reglist functions to work with 64 bit positions. Make bgzf_idx_push take 64 bit positions and increase size of hts_idx_cache_entry::beg and hts_idx_cache_entry::end. Remove restriction on stored positions to INT_MAX in hts_reglist_create() and hts_iter_querys(). hts_reglist_create() can be simplified a bit as it's internally using hts_pair_pos_t to store intervals, which is the same as hts_reglist_t::intervals where they are eventually stored. Old type hts_pair32_t is made a typedef for hts_pair_pos_t as the two structs had become exactly the same. This allows hts_reglist_t::intervals to be changed to type hts_pair_pos_t.
Round-trip SAM -> SAM.gz -> SAM Indexing both on-the-fly and for an existing file. Index look-ups and iterators.
Move duplicated code for calculating n_lvls into its own function. Allow n_lvls to increase in vcf_idx_init for very long references.
Increases size of bcf_idinfo_t::info. This could be used in the future to increase the maximum value for Number= supported in header lines (although the current value is already rather generous).
Needed to support structural variants with END > 2 Gbases. Add BCF_BT_INT64 with the obvious value left clear in the BCF spec. Add BCF_HT_LONG so that it's possible to use int64_t arrays with bcf_get_info_values() and bcf_update_info(). Currently bcf_update_info() only allows a single 64-bit value to be stored. Change bcf_info_t so it can handle a single int64_t value. bcf_info_t::len is also moved to avoid creating a hole. Update vcf_parse() so it can store 64-bit INFO values (again only one is allowed) and use this for END. Add 64 bit value support in bcf_unpack_info_core1() and vcf_format(). It's now possible to round-trip a VCF with large positions, including for structural variants. It's also possible to index them on-the-fly.
The normal way of estimating n_lvls breaks down at about 4 Gbases for the default CSI min_shift. This adds a very simple parser to grab any reference lengths from the headers and find the longest. The value is then used to adjust n_lvls if necessary.
MAX_CSI_COOR is now about 12 Tbp, which is the limit for a CSI index with min_shift 14.
Round trip test, including structural variation with END INFO tag. Indexing an existing file. Index look-up using tabix. Allow test_compare() to avoid differences due to newlines on Windows.
While not strictly necessary (the positions in question are relative to that of read 'b') it makes data types consistent and reduces the possibility of accidental overflow. Also adds a check that the position in the sequence is valid before trying to use it for array look-ups.
As bcf_get_info_int64() and bcf_update_info_int64() are new interfaces, they can be made static inlines instead of macros so that the compiler can check the data type of the values / dst parameter. The old macros probably have to stay as they are as we don't know how they are being used in third-party code. The documentation around the bcf_get_info_ and bcf_update_info_ macros is made a bit more doxygen-like.
0b836c0
to
983244b
Compare
Rebased with documentation fixes, NEWS update and README.large_positions.md file. |
Bumped TWO_TO_THREE_TRANSITION_COUNT due to ABI changes.
In the future we're going to have to tackle this properly, but this PR basically makes this possible without (I hope!) needing another round of ABI breakage. To achieve this, it makes pos, mpos and isize all 64-bit in the external/public APIs.
SAM now works natively in 64-bit. There is place-holder code in CRAM, mainly for testing purposes, taken from io_lib's CRAM4 branch with appropriate ifdefs to block it from normal compilation. We have no roadmap for CRAM4 yet, but if/when it does appear those ifdefs will become version checks instead.
Caveats
l
BAM encoding for signed 64-bit integer, however I don't think I've ever seen it in the wild.hts_readrec_func
prototypes, but these are likely few and far between outside of htslib itself so I deemed it acceptable.Known bugs (will fix and edit this - just spotted it): multi iterator is still 32-bit.