-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
64-bit reference positions - ABI/API placeholder #709
Commits on Sep 25, 2019
-
First stage of supporting large chromosomes.
The in-memory data structures are 64-bit for pos, mate-pos and insert size, along with iterators. The code that fills these out is still all 32-bit so this is basically a place-holder for ABI purposes. The exception to this is SAM support, which being purely textual has the minimal changes necessary to read and write 64-bit values. Split the hts_parse_reg API to 32-bit and 64-bit variants (although 64 bit version is only used internally at the moment). To much code uses this with addresses of 32-bit quantities, so for compatibility hts_parse_reg() cannot change. 64 bit parse_reg uses a slightly tweaked value for the end for chromosomes with no range (eg "chr1"). Using INT64_MAX would yield -1 when cast into int. We now have nearly 64-bit max which when truncated to 32-bit is still INT_MAX. The only change needed in samtools to pass tests is fixing cur5 and pre5 in bam_mate.c.
Configuration menu - View commit details
-
Copy full SHA for 418a183 - Browse repository at this point
Copy the full SHA 418a183View commit details -
ABI/API placeholder for 64-bit positions in CRAM.
This upgrades all the internal data types to 64-bit and adds I/O functions for encoding and decoding, but doesn't change the format itself. There is also code with #ifdef LARGE_POS **which should not be used** in production. This is there simply to act as a test for the 64-bit API in htslib iterators. The code is mainly copied from io_lib CRAM4 experimental branch: jkbonfield/io_lib@1150b9c
Configuration menu - View commit details
-
Copy full SHA for 9e2984a - Browse repository at this point
Copy the full SHA 9e2984aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 00b72b6 - Browse repository at this point
Copy the full SHA 00b72b6View commit details -
Configuration menu - View commit details
-
Copy full SHA for a19593a - Browse repository at this point
Copy the full SHA a19593aView commit details -
More cram codec improvements for 64-bit quantities.
Added BETA and HUFFMAN support. (Beta can occasionally be used, althoug huffman is mainly for completeness now.) Also fixed the decoder2encoder logic (used by cram_transcode_rg).
Configuration menu - View commit details
-
Copy full SHA for 24436de - Browse repository at this point
Copy the full SHA 24436deView commit details -
Further 64-bit position API/ABI changes.
Fixed the pileup iterators to internally use 64-bit states. The API for this returns via int *pos, so to keep API consistency we now have new bam_plp64_next, bam_plp64_auto and bam_mplp64_auto functions. Similarly for fai handling fai_fetch64 and faidx_fetch_seq64. Minor tweak to sam_cap_mapq and sam_prob_realn API. Pos parameter is passed by value so doesn't need a new API (promotion is enough), but code hasn't been curated yet. The implementation of these two functions needs more work to be 64-bit clean.
Configuration menu - View commit details
-
Copy full SHA for eb2334e - Browse repository at this point
Copy the full SHA eb2334eView commit details
Commits on Sep 26, 2019
-
Change from int64_t to hts_pos_t typedef.
Also fixed missing dependency in bcf_sr_sort.o. CRAM is still using int64_t internally as this is referring to the *potential* on-disk format (with -DLARGE_POS) and none of the changes there are externally visible in the public API anyway. Include stdio.h in hts_defs.h so the mingw __MINGW_PRINTF_FORMAT gets defined in all the places where it's needed.
Configuration menu - View commit details
-
Copy full SHA for 835e133 - Browse repository at this point
Copy the full SHA 835e133View commit details -
Make sam_hdr_tid2len() return hts_pos_t. Change length stored in sam_hrec_sq_t to hts_pos_t. Make sam header parser use strtoll() instead of atoi(). Unfortunately changing the size of the header target_len array is difficult as some external software attempts to resize it as a multiple of sizeof(uint32_t). Work around this by storing large values as UINT32_MAX and repurpose the sdict pointer (unused since 7a853e8) as a way of passing the real size through. Code that supports long references will need to use sam_hdr_tid2len() to get the length. Adds tests for reading and writing SAM files.
Configuration menu - View commit details
-
Copy full SHA for 329d2b9 - Browse repository at this point
Copy the full SHA 329d2b9View commit details -
Eliminate struct holes in bam1_core_t and bam1_t
Swap tid and pos in bam1_core_t. Removes four bytes of padding between tid and pos, and four bytes between mtid and mpos. Reverse order of l_data, data and id in bam1_t. Removes four bytes of padding after l_data on 64-bit platforms. Adjust documentation to match new ordering.
Configuration menu - View commit details
-
Copy full SHA for 3108bee - Browse repository at this point
Copy the full SHA 3108beeView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3d49c61 - Browse repository at this point
Copy the full SHA 3d49c61View commit details -
Make regions and indexing work with long references
Update reglist functions to work with 64 bit positions. Make bgzf_idx_push take 64 bit positions and increase size of hts_idx_cache_entry::beg and hts_idx_cache_entry::end. Remove restriction on stored positions to INT_MAX in hts_reglist_create() and hts_iter_querys(). hts_reglist_create() can be simplified a bit as it's internally using hts_pair_pos_t to store intervals, which is the same as hts_reglist_t::intervals where they are eventually stored. Old type hts_pair32_t is made a typedef for hts_pair_pos_t as the two structs had become exactly the same. This allows hts_reglist_t::intervals to be changed to type hts_pair_pos_t.
Configuration menu - View commit details
-
Copy full SHA for c557f72 - Browse repository at this point
Copy the full SHA c557f72View commit details -
Round-trip SAM -> SAM.gz -> SAM Indexing both on-the-fly and for an existing file. Index look-ups and iterators.
Configuration menu - View commit details
-
Copy full SHA for 674714e - Browse repository at this point
Copy the full SHA 674714eView commit details -
Configuration menu - View commit details
-
Copy full SHA for c804f81 - Browse repository at this point
Copy the full SHA c804f81View commit details -
Fix n_lvls calculation in vcf_idx_init.
Move duplicated code for calculating n_lvls into its own function. Allow n_lvls to increase in vcf_idx_init for very long references.
Configuration menu - View commit details
-
Copy full SHA for 2e9bfe3 - Browse repository at this point
Copy the full SHA 2e9bfe3View commit details -
Store > 32 bit reference lengths in bcf_idinfo_t contig data
Increases size of bcf_idinfo_t::info. This could be used in the future to increase the maximum value for Number= supported in header lines (although the current value is already rather generous).
Configuration menu - View commit details
-
Copy full SHA for 10709d6 - Browse repository at this point
Copy the full SHA 10709d6View commit details -
Allow storage of 64 bit INFO values for vcf files
Needed to support structural variants with END > 2 Gbases. Add BCF_BT_INT64 with the obvious value left clear in the BCF spec. Add BCF_HT_LONG so that it's possible to use int64_t arrays with bcf_get_info_values() and bcf_update_info(). Currently bcf_update_info() only allows a single 64-bit value to be stored. Change bcf_info_t so it can handle a single int64_t value. bcf_info_t::len is also moved to avoid creating a hole. Update vcf_parse() so it can store 64-bit INFO values (again only one is allowed) and use this for END. Add 64 bit value support in bcf_unpack_info_core1() and vcf_format(). It's now possible to round-trip a VCF with large positions, including for structural variants. It's also possible to index them on-the-fly.
Configuration menu - View commit details
-
Copy full SHA for 1606913 - Browse repository at this point
Copy the full SHA 1606913View commit details -
Attempt to make tabix work for VCF and SAM with long references
The normal way of estimating n_lvls breaks down at about 4 Gbases for the default CSI min_shift. This adds a very simple parser to grab any reference lengths from the headers and find the longest. The value is then used to adjust n_lvls if necessary.
Configuration menu - View commit details
-
Copy full SHA for 9ed0641 - Browse repository at this point
Copy the full SHA 9ed0641View commit details -
Make synced_bcf_reader work with 64 bit positions.
MAX_CSI_COOR is now about 12 Tbp, which is the limit for a CSI index with min_shift 14.
Configuration menu - View commit details
-
Copy full SHA for 0f0d091 - Browse repository at this point
Copy the full SHA 0f0d091View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9c943da - Browse repository at this point
Copy the full SHA 9c943daView commit details -
Round trip test, including structural variation with END INFO tag. Indexing an existing file. Index look-up using tabix. Allow test_compare() to avoid differences due to newlines on Windows.
Configuration menu - View commit details
-
Copy full SHA for 4f1a3fc - Browse repository at this point
Copy the full SHA 4f1a3fcView commit details -
Use hts_pos_t in tweak_overlap_quality() and related functions
While not strictly necessary (the positions in question are relative to that of read 'b') it makes data types consistent and reduces the possibility of accidental overflow. Also adds a check that the position in the sequence is valid before trying to use it for array look-ups.
Configuration menu - View commit details
-
Copy full SHA for 995069d - Browse repository at this point
Copy the full SHA 995069dView commit details -
Add bcf_get_info_int64; make bcf_update_info_int64 a static inline
As bcf_get_info_int64() and bcf_update_info_int64() are new interfaces, they can be made static inlines instead of macros so that the compiler can check the data type of the values / dst parameter. The old macros probably have to stay as they are as we don't know how they are being used in third-party code. The documentation around the bcf_get_info_ and bcf_update_info_ macros is made a bit more doxygen-like.
Configuration menu - View commit details
-
Copy full SHA for f84bba1 - Browse repository at this point
Copy the full SHA f84bba1View commit details -
Configuration menu - View commit details
-
Copy full SHA for 983244b - Browse repository at this point
Copy the full SHA 983244bView commit details