Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

64-bit reference positions - ABI/API placeholder #709

Merged
merged 23 commits into from
Sep 26, 2019

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Jun 4, 2018

In the future we're going to have to tackle this properly, but this PR basically makes this possible without (I hope!) needing another round of ABI breakage. To achieve this, it makes pos, mpos and isize all 64-bit in the external/public APIs.

SAM now works natively in 64-bit. There is place-holder code in CRAM, mainly for testing purposes, taken from io_lib's CRAM4 branch with appropriate ifdefs to block it from normal compilation. We have no roadmap for CRAM4 yet, but if/when it does appear those ifdefs will become version checks instead.

Caveats

  • The SAM CP tag is a non-starter unless we add e.g. l BAM encoding for signed 64-bit integer, however I don't think I've ever seen it in the wild.
  • We may want to duplicate some functions and have 64-bit named variants, to offer better migration. I did this with hts_parse_reg as without it so much third-party code dies in horrible ways. Arguably we could do the same with hts_readrec_func prototypes, but these are likely few and far between outside of htslib itself so I deemed it acceptable.
  • Structure layout probably isn't efficient now as there may be padding introduced. We can juggle that or just put in explicit 32-bit reserved ints ready for future expansion. (We have some ideas on that anyway.)
  • Hts_parse_reg64 needs merging with the hts_parse_reg2 from Work around colons in reference names (GRCh38 HLA). #708.
  • Making hts_pair32_t a pair of 64-bit ints is appalling! However it was vastly less API change this way! We could bite the bullet and change more API, but note the internal fields are named differently between hts_pair32_t and hts_pair64_t (why?).
  • Samtools sort and bedidx code has its own set of issues, as they shift 32 bit positions into a 64 bit int combined with other parameters.

Known bugs (will fix and edit this - just spotted it): multi iterator is still 32-bit.

@jmarshall
Copy link
Member

int is the totally generic integer type, while int64_t is rather less so.

Would it be worth introducing a typedef — hts_pos_t perhaps — and using it in the API function declarations etc? This would emphasise that these parameters are positions rather than emphasising an implementation detail.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jun 5, 2018

I was thinking that while writing this infact, but sadly after I was half way through and lacked will power to go back and change it all! I do think it's probably the right thing to do though, with associated PRIhts_pos macro for printing.

@daviesrob daviesrob force-pushed the large_pos branch 4 times, most recently from dc5449d to 0fc857f Compare August 13, 2018 15:11
@daviesrob
Copy link
Member

TODO:

  • Ensure new header API can handle long references.
  • Aux type l for int64_t values. (Don't bother with unsigned)

@daviesrob
Copy link
Member

Rebased with additions to make headers work.

Notes:

  • Code using HTSlib will likely need minor changes, even if it only supports 32 bit positions, after this is merged:
    • Anything that mixes signed and unsigned types like this naughtiness in Samtools will give the wrong answer due to missing sign extension.
    • printf formats will need to be changed where the item being printed is now 64 bits.
  • As upstream code may try to reallocate the target_len array (for example here in samtools), trying to change its size is likely to cause much breakage. A work-around has been put in place where long refs are stored as UINT_MAX in target_len and the real length is stored elsewhere. Code that wants to support long references will have to use sam_hdr_tid2len() to get the reference length.

We need to test against some htslib-using packages to check that they still work correctly when built against this branch. I've tried samtools and bcftools, which work bar the items noted above. I've also tried pysam, which manages fairly well (it builds OK, although a few tests fail due to unrelated issues with illegal headers in a couple of test files).

// also INT_MAX instead of -1. This avoids bugs with old code
// using the new hts_pos_t data type.
#define HTS_POS_MAX ((((int64_t)INT_MAX)<<32)|INT_MAX)
#define HTS_POS_MIN INT64_MIN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the same issue with INT64_MIN being reduced to 0 when cast into a 32-bit int?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think HTS_POS_MIN is there for completeness. It isn't actually returned by any functions in HTSlib, so you would only use it in code upgraded to work with type hts_pos_t. It's also only meaningful for fields like isize that could legitimately take a negative value.

if (hrecs->ref[i].len < UINT32_MAX) {
bh->target_len[i] = hrecs->ref[i].len;
} else {
bh->target_len[i] = UINT32_MAX;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't sdict be updated at this point?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't need to because sam_hdr_tid2len() uses the sam_hrecs_t::ref array if available. So if target_len is being updated from sam_hrecs_t::ref, sdict isn't going to be used any more and arguably it could actually be freed at this point.

htslib/hts.h Outdated
typedef struct {
uint32_t beg, end;
//uint32_t beg, end;
hts_pos_t beg, end; // sorry for the bad naming: FIXME!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hts_pair32_t could be deprecated and its few uses replaced with hts_pair_pos_t.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly yes, although it's exposed in hts_reglist_t::intervals and I see that samtools allocates one in bed_reglist().

Possibly we should make hts_pair32_t a typedef of hts_pair_pos_t.

@@ -1290,6 +1306,8 @@ typedef struct __bam_mplp_t *bam_mplp_t;
int bam_plp_push(bam_plp_t iter, const bam1_t *b);
const bam_pileup1_t *bam_plp_next(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp);
const bam_pileup1_t *bam_plp_auto(bam_plp_t iter, int *_tid, int *_pos, int *_n_plp);
const bam_pileup1_t *bam_plp64_next(bam_plp_t iter, int *_tid, hts_pos_t *_pos, int *_n_plp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't bam_pileup1_t.qpos (line 1283) be changed from int32_t to hts_pos_t ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qpos is on the read, not the reference. We're not attempting to support reads longer than 2Gbases at the moment.

@daviesrob
Copy link
Member

Rebased with more additions:

  • hts_reglist_t and associated funtions support 64 bit positions

  • 64 bit position support in bgzf_idx_push()

  • Read / write bcf1_t::pos as 64 bit for VCF.

  • Support VCF headers with contig lengths > 2Gbp.

  • Minimal support for 64 bit INFO tags in VCF files. This is needed for 64 bit END tags in structural variants. This adds BCF_BT_INT64 so we can store int64_t INFO tags, and BCF_HT_LONG so wa can pass int64_t arrays to / from bcf_get_info_values() and bcf_update_info().

  • Various updates to make VCF indexes and tabix work on files with large positions.

  • Updates to synced_bcf_reader for 64 bit positions.

  • Updates to regidx for 64 bit positions.

  • Round-trip and indexing tests for SAM and VCF files with large position values.

Now is the time to seriously kick the tyres on this one, especially the VCF updates. It's looking ready to be merged, so if anyone has comments it would be best to add them sooner rather than later.

See also samtools/samtools#1107 and samtools/bcftools#1083 which need to be merged before this PR so that they continue to work properly.

sam.c Outdated
@@ -3960,7 +4018,7 @@ void bam_plp_destructor(bam_plp_t plp,
* Returns BAM_CMATCH, -1 when there is no more cigar to process or the requested position is not covered,
* or -2 on error.
*/
static inline int cigar_iref2iseq_set(uint32_t **cigar, uint32_t *cigar_max, int *icig, int *iseq, int *iref)
static inline int cigar_iref2iseq_set(uint32_t **cigar, uint32_t *cigar_max, int *icig, int *iseq, hts_pos_t *iref)
{
int pos = *iref;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 bit to 32 bit conversion. Is this right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it gets away with it because *iref is set to an offset to the beginning of the read, so should fit within an int (barring issues with very big BAM_CDEL or BAM_CREF_SKIP CIGAR operations). However it might be better to make cigar_iref2iseq_set, cigar_iref2iseq_next and tweak_overlap_quality, which calls them, use hts_pos_t all round as avoiding any possibility of overflow will make the code easier to understand.

htslib/vcf.h Outdated
@@ -680,6 +683,7 @@ set to one of BCF_ERR* codes and must be checked before calling bcf_write().
* Returns 0 on success or negative value on error.
*/
#define bcf_update_info_int32(hdr,line,key,values,n) bcf_update_info((hdr),(line),(key),(values),(n),BCF_HT_INT)
#define bcf_update_info_int64(hdr,line,key,values,n) bcf_update_info((hdr),(line),(key),(values),(n),BCF_HT_LONG)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should probably be a corresponding get macro at line 777:

#define bcf_get_info_int64(hdr,line,tag,dst,ndst)  bcf_get_info_values(hdr,line,tag,(void**)(dst),ndst,BCF_HT_LONG)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. I also decided to make bcf_update_info_int64() and bcf_get_info_int64() static inline functions instead of macros, so we can ensure that they are used with int64_t arrays.

Copy link
Contributor

@whitwham whitwham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks fine to me. There are some cases of mixing int64_t with hts_pos_t but unless we move to the latter being int128_t I do not think it will cause a problem.

What we do need to do is update all the copyright dates in changed files. It is a requirement in our software policy.

htslib/vcf.h Outdated
* @param n: number of values in the array. When set to 0, the INFO tag is removed
* @return 0 on success or negative value on error.
*
* The @p string in bcf_update_info_flag() is optional,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph can be removed, as this documentation text only refers to bcf_update_info_int64.

htslib/vcf.h Outdated
* @param tag: INFO tag to retrieve
* @param dst: *dst is pointer to a memory location, can point to NULL
* @param ndst: pointer to the size of allocated memory
* @return 0 on success
Copy link
Contributor

@valeriuo valeriuo Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a discrepancy between the return code for success and the paragraph below, which states:
Returns negative value on error or the number of written values (including missing values) on success.

htslib/vcf.h Outdated
* -2 .. clash between types defined in the header and encountered in the VCF record
* -3 .. tag is not present in the VCF record
*
* Returns negative value on error or the number of written values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

test/sam.c Outdated
bam1_t_size = 36 + sizeof(int) + 4 + sizeof (char *) + sizeof(uint64_t);
assert(sizeof(hts_pos_t) == 8 || sizeof(hts_pos_t) == 4);
int core_size = sizeof(hts_pos_t) == 8 ? 48 : 36;
bam1_t_size = core_size + sizeof (int) + 4 + sizeof (char *) + sizeof(uint64_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although a test code, this looks a bit ugly. Maybe it can be changed to something like:

bam1_t_size = core_size + sizeof (int) + sizeof (char *) + sizeof (uint64_t) + (sizeof (uint32_t) << 1);

This includes the newly added mempolicy member.

jkbonfield and others added 10 commits September 25, 2019 11:42
The in-memory data structures are 64-bit for pos, mate-pos and insert
size, along with iterators.  The code that fills these out is still
all 32-bit so this is basically a place-holder for ABI purposes.

The exception to this is SAM support, which being purely textual has
the minimal changes necessary to read and write 64-bit values.

Split the hts_parse_reg API to 32-bit and 64-bit variants (although
64 bit version is only used internally at the moment).  To much
code uses this with addresses of 32-bit quantities, so for
compatibility hts_parse_reg() cannot change.

64 bit parse_reg uses a slightly tweaked value for the end for
chromosomes with no range (eg "chr1").  Using INT64_MAX would
yield -1 when cast into int.  We now have nearly 64-bit max which
when truncated to 32-bit is still INT_MAX.

The only change needed in samtools to pass tests is fixing cur5
and pre5 in bam_mate.c.
This upgrades all the internal data types to 64-bit and adds I/O
functions for encoding and decoding, but doesn't change the format
itself.

There is also code with #ifdef LARGE_POS **which should not be used**
in production.  This is there simply to act as a test for the 64-bit
API in htslib iterators.

The code is mainly copied from io_lib CRAM4 experimental branch:
jkbonfield/io_lib@1150b9c
Added BETA and HUFFMAN support. (Beta can occasionally be used,
althoug huffman is mainly for completeness now.)

Also fixed the decoder2encoder logic (used by cram_transcode_rg).
Fixed the pileup iterators to internally use 64-bit states.  The API
for this returns via int *pos, so to keep API consistency we now have
new bam_plp64_next, bam_plp64_auto and bam_mplp64_auto functions.
Similarly for fai handling fai_fetch64 and faidx_fetch_seq64.

Minor tweak to sam_cap_mapq and sam_prob_realn API. Pos parameter is
passed by value so doesn't need a new API (promotion is enough), but
code hasn't been curated yet.  The implementation of these two
functions needs more work to be 64-bit clean.
Also fixed missing dependency in bcf_sr_sort.o.

CRAM is still using int64_t internally as this is referring to the
*potential* on-disk format (with -DLARGE_POS) and none of the changes
there are externally visible in the public API anyway.

Include stdio.h in hts_defs.h so the mingw __MINGW_PRINTF_FORMAT
gets defined in all the places where it's needed.
Make sam_hdr_tid2len() return hts_pos_t.

Change length stored in sam_hrec_sq_t to hts_pos_t.

Make sam header parser use strtoll() instead of atoi().

Unfortunately changing the size of the header target_len array
is difficult as some external software attempts to resize it
as a multiple of sizeof(uint32_t).  Work around this by storing
large values as UINT32_MAX and repurpose the sdict pointer (unused
since 7a853e8) as a way of passing the real size through.
Code that supports long references will need to use
sam_hdr_tid2len() to get the length.

Adds tests for reading and writing SAM files.
Swap tid and pos in bam1_core_t.  Removes four bytes of padding
between tid and pos, and four bytes between mtid and mpos.

Reverse order of l_data, data and id in bam1_t.  Removes four
bytes of padding after l_data on 64-bit platforms.

Adjust documentation to match new ordering.
Update reglist functions to work with 64 bit positions.
Make bgzf_idx_push take 64 bit positions and increase size
of hts_idx_cache_entry::beg and hts_idx_cache_entry::end.

Remove restriction on stored positions to INT_MAX in
hts_reglist_create() and hts_iter_querys().

hts_reglist_create() can be simplified a bit as it's internally
using hts_pair_pos_t to store intervals, which is the same as
hts_reglist_t::intervals where they are eventually stored.

Old type hts_pair32_t is made a typedef for hts_pair_pos_t as
the two structs had become exactly the same.  This allows
hts_reglist_t::intervals to be changed to type hts_pair_pos_t.
Round-trip SAM -> SAM.gz -> SAM

Indexing both on-the-fly and for an existing file.

Index look-ups and iterators.
Move duplicated code for calculating n_lvls into its own function.
Allow n_lvls to increase in vcf_idx_init for very long references.
Increases size of bcf_idinfo_t::info.  This could be used in the
future to increase the maximum value for Number= supported
in header lines (although the current value is already rather
generous).
Needed to support structural variants with END > 2 Gbases.

Add BCF_BT_INT64 with the obvious value left clear in the BCF spec.

Add BCF_HT_LONG so that it's possible to use int64_t arrays with
bcf_get_info_values() and bcf_update_info().  Currently
bcf_update_info() only allows a single 64-bit value to be stored.

Change bcf_info_t so it can handle a single int64_t value.
bcf_info_t::len is also moved to avoid creating a hole.

Update vcf_parse() so it can store 64-bit INFO values (again only
one is allowed) and use this for END.

Add 64 bit value support in bcf_unpack_info_core1() and
vcf_format().

It's now possible to round-trip a VCF with large positions,
including for structural variants.  It's also possible to
index them on-the-fly.
The normal way of estimating n_lvls breaks down at about
4 Gbases for the default CSI min_shift.  This adds a very simple
parser to grab any reference lengths from the headers and find
the longest.  The value is then used to adjust n_lvls if necessary.
MAX_CSI_COOR is now about 12 Tbp, which is the limit for
a CSI index with min_shift 14.
Round trip test, including structural variation with END INFO tag.

Indexing an existing file.

Index look-up using tabix.

Allow test_compare() to avoid differences due to newlines on
Windows.
While not strictly necessary (the positions in question are
relative to that of read 'b') it makes data types consistent
and reduces the possibility of accidental overflow.

Also adds a check that the position in the sequence is valid
before trying to use it for array look-ups.
As bcf_get_info_int64() and bcf_update_info_int64() are new
interfaces, they can be made static inlines instead of macros so
that the compiler can check the data type of the values / dst
parameter.  The old macros probably have to stay as they
are as we don't know how they are being used in third-party code.

The documentation around the bcf_get_info_ and bcf_update_info_
macros is made a bit more doxygen-like.
@daviesrob
Copy link
Member

Rebased with documentation fixes, NEWS update and README.large_positions.md file.

daviesrob added a commit to daviesrob/htslib that referenced this pull request Sep 26, 2019
Bumped TWO_TO_THREE_TRANSITION_COUNT due to ABI changes.
@daviesrob daviesrob merged commit 983244b into samtools:develop Sep 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants