Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

64-bit reference positions - ABI/API placeholder #709

Merged
merged 23 commits into from
Sep 26, 2019

Commits on Sep 25, 2019

  1. First stage of supporting large chromosomes.

    The in-memory data structures are 64-bit for pos, mate-pos and insert
    size, along with iterators.  The code that fills these out is still
    all 32-bit so this is basically a place-holder for ABI purposes.
    
    The exception to this is SAM support, which being purely textual has
    the minimal changes necessary to read and write 64-bit values.
    
    Split the hts_parse_reg API to 32-bit and 64-bit variants (although
    64 bit version is only used internally at the moment).  To much
    code uses this with addresses of 32-bit quantities, so for
    compatibility hts_parse_reg() cannot change.
    
    64 bit parse_reg uses a slightly tweaked value for the end for
    chromosomes with no range (eg "chr1").  Using INT64_MAX would
    yield -1 when cast into int.  We now have nearly 64-bit max which
    when truncated to 32-bit is still INT_MAX.
    
    The only change needed in samtools to pass tests is fixing cur5
    and pre5 in bam_mate.c.
    jkbonfield authored and daviesrob committed Sep 25, 2019
    Configuration menu
    Copy the full SHA
    418a183 View commit details
    Browse the repository at this point in the history
  2. ABI/API placeholder for 64-bit positions in CRAM.

    This upgrades all the internal data types to 64-bit and adds I/O
    functions for encoding and decoding, but doesn't change the format
    itself.
    
    There is also code with #ifdef LARGE_POS **which should not be used**
    in production.  This is there simply to act as a test for the 64-bit
    API in htslib iterators.
    
    The code is mainly copied from io_lib CRAM4 experimental branch:
    jkbonfield/io_lib@1150b9c
    jkbonfield authored and daviesrob committed Sep 25, 2019
    Configuration menu
    Copy the full SHA
    9e2984a View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    00b72b6 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    a19593a View commit details
    Browse the repository at this point in the history
  5. More cram codec improvements for 64-bit quantities.

    Added BETA and HUFFMAN support. (Beta can occasionally be used,
    althoug huffman is mainly for completeness now.)
    
    Also fixed the decoder2encoder logic (used by cram_transcode_rg).
    jkbonfield authored and daviesrob committed Sep 25, 2019
    Configuration menu
    Copy the full SHA
    24436de View commit details
    Browse the repository at this point in the history
  6. Further 64-bit position API/ABI changes.

    Fixed the pileup iterators to internally use 64-bit states.  The API
    for this returns via int *pos, so to keep API consistency we now have
    new bam_plp64_next, bam_plp64_auto and bam_mplp64_auto functions.
    Similarly for fai handling fai_fetch64 and faidx_fetch_seq64.
    
    Minor tweak to sam_cap_mapq and sam_prob_realn API. Pos parameter is
    passed by value so doesn't need a new API (promotion is enough), but
    code hasn't been curated yet.  The implementation of these two
    functions needs more work to be 64-bit clean.
    jkbonfield authored and daviesrob committed Sep 25, 2019
    Configuration menu
    Copy the full SHA
    eb2334e View commit details
    Browse the repository at this point in the history

Commits on Sep 26, 2019

  1. Change from int64_t to hts_pos_t typedef.

    Also fixed missing dependency in bcf_sr_sort.o.
    
    CRAM is still using int64_t internally as this is referring to the
    *potential* on-disk format (with -DLARGE_POS) and none of the changes
    there are externally visible in the public API anyway.
    
    Include stdio.h in hts_defs.h so the mingw __MINGW_PRINTF_FORMAT
    gets defined in all the places where it's needed.
    jkbonfield authored and daviesrob committed Sep 26, 2019
    Configuration menu
    Copy the full SHA
    835e133 View commit details
    Browse the repository at this point in the history
  2. Make headers 64-bit compliant

    Make sam_hdr_tid2len() return hts_pos_t.
    
    Change length stored in sam_hrec_sq_t to hts_pos_t.
    
    Make sam header parser use strtoll() instead of atoi().
    
    Unfortunately changing the size of the header target_len array
    is difficult as some external software attempts to resize it
    as a multiple of sizeof(uint32_t).  Work around this by storing
    large values as UINT32_MAX and repurpose the sdict pointer (unused
    since 7a853e8) as a way of passing the real size through.
    Code that supports long references will need to use
    sam_hdr_tid2len() to get the length.
    
    Adds tests for reading and writing SAM files.
    daviesrob committed Sep 26, 2019
    Configuration menu
    Copy the full SHA
    329d2b9 View commit details
    Browse the repository at this point in the history
  3. Eliminate struct holes in bam1_core_t and bam1_t

    Swap tid and pos in bam1_core_t.  Removes four bytes of padding
    between tid and pos, and four bytes between mtid and mpos.
    
    Reverse order of l_data, data and id in bam1_t.  Removes four
    bytes of padding after l_data on 64-bit platforms.
    
    Adjust documentation to match new ordering.
    daviesrob committed Sep 26, 2019
    Configuration menu
    Copy the full SHA
    3108bee View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    3d49c61 View commit details
    Browse the repository at this point in the history
  5. Make regions and indexing work with long references

    Update reglist functions to work with 64 bit positions.
    Make bgzf_idx_push take 64 bit positions and increase size
    of hts_idx_cache_entry::beg and hts_idx_cache_entry::end.
    
    Remove restriction on stored positions to INT_MAX in
    hts_reglist_create() and hts_iter_querys().
    
    hts_reglist_create() can be simplified a bit as it's internally
    using hts_pair_pos_t to store intervals, which is the same as
    hts_reglist_t::intervals where they are eventually stored.
    
    Old type hts_pair32_t is made a typedef for hts_pair_pos_t as
    the two structs had become exactly the same.  This allows
    hts_reglist_t::intervals to be changed to type hts_pair_pos_t.
    daviesrob committed Sep 26, 2019
    Configuration menu
    Copy the full SHA
    c557f72 View commit details
    Browse the repository at this point in the history
  6. Add large position tests

    Round-trip SAM -> SAM.gz -> SAM
    
    Indexing both on-the-fly and for an existing file.
    
    Index look-ups and iterators.
    daviesrob committed Sep 26, 2019
    Configuration menu
    Copy the full SHA
    674714e View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    c804f81 View commit details
    Browse the repository at this point in the history
  8. Fix n_lvls calculation in vcf_idx_init.

    Move duplicated code for calculating n_lvls into its own function.
    Allow n_lvls to increase in vcf_idx_init for very long references.
    daviesrob committed Sep 26, 2019
    Configuration menu
    Copy the full SHA
    2e9bfe3 View commit details
    Browse the repository at this point in the history
  9. Store > 32 bit reference lengths in bcf_idinfo_t contig data

    Increases size of bcf_idinfo_t::info.  This could be used in the
    future to increase the maximum value for Number= supported
    in header lines (although the current value is already rather
    generous).
    daviesrob committed Sep 26, 2019
    Configuration menu
    Copy the full SHA
    10709d6 View commit details
    Browse the repository at this point in the history
  10. Allow storage of 64 bit INFO values for vcf files

    Needed to support structural variants with END > 2 Gbases.
    
    Add BCF_BT_INT64 with the obvious value left clear in the BCF spec.
    
    Add BCF_HT_LONG so that it's possible to use int64_t arrays with
    bcf_get_info_values() and bcf_update_info().  Currently
    bcf_update_info() only allows a single 64-bit value to be stored.
    
    Change bcf_info_t so it can handle a single int64_t value.
    bcf_info_t::len is also moved to avoid creating a hole.
    
    Update vcf_parse() so it can store 64-bit INFO values (again only
    one is allowed) and use this for END.
    
    Add 64 bit value support in bcf_unpack_info_core1() and
    vcf_format().
    
    It's now possible to round-trip a VCF with large positions,
    including for structural variants.  It's also possible to
    index them on-the-fly.
    daviesrob committed Sep 26, 2019
    Configuration menu
    Copy the full SHA
    1606913 View commit details
    Browse the repository at this point in the history
  11. Attempt to make tabix work for VCF and SAM with long references

    The normal way of estimating n_lvls breaks down at about
    4 Gbases for the default CSI min_shift.  This adds a very simple
    parser to grab any reference lengths from the headers and find
    the longest.  The value is then used to adjust n_lvls if necessary.
    daviesrob committed Sep 26, 2019
    Configuration menu
    Copy the full SHA
    9ed0641 View commit details
    Browse the repository at this point in the history
  12. Make synced_bcf_reader work with 64 bit positions.

    MAX_CSI_COOR is now about 12 Tbp, which is the limit for
    a CSI index with min_shift 14.
    daviesrob committed Sep 26, 2019
    Configuration menu
    Copy the full SHA
    0f0d091 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    9c943da View commit details
    Browse the repository at this point in the history
  14. Add VCF long reference tests.

    Round trip test, including structural variation with END INFO tag.
    
    Indexing an existing file.
    
    Index look-up using tabix.
    
    Allow test_compare() to avoid differences due to newlines on
    Windows.
    daviesrob committed Sep 26, 2019
    Configuration menu
    Copy the full SHA
    4f1a3fc View commit details
    Browse the repository at this point in the history
  15. Use hts_pos_t in tweak_overlap_quality() and related functions

    While not strictly necessary (the positions in question are
    relative to that of read 'b') it makes data types consistent
    and reduces the possibility of accidental overflow.
    
    Also adds a check that the position in the sequence is valid
    before trying to use it for array look-ups.
    daviesrob committed Sep 26, 2019
    Configuration menu
    Copy the full SHA
    995069d View commit details
    Browse the repository at this point in the history
  16. Add bcf_get_info_int64; make bcf_update_info_int64 a static inline

    As bcf_get_info_int64() and bcf_update_info_int64() are new
    interfaces, they can be made static inlines instead of macros so
    that the compiler can check the data type of the values / dst
    parameter.  The old macros probably have to stay as they
    are as we don't know how they are being used in third-party code.
    
    The documentation around the bcf_get_info_ and bcf_update_info_
    macros is made a bit more doxygen-like.
    daviesrob committed Sep 26, 2019
    Configuration menu
    Copy the full SHA
    f84bba1 View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    983244b View commit details
    Browse the repository at this point in the history