-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work around colons in reference names (GRCh38 HLA). #708
Conversation
hts.c
Outdated
// FIXME: do we need to permit tid=-1 for reference "*" to indicate unmapped | ||
// reads, and strictly have NULL as failure test? | ||
int tid_, s_len = strlen(s); // int is sufficient given beg/end types | ||
if (!tid) tid = &tid_; // simplifies code below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check all the pointers at the beginning, especially as the function would now be part of the interface.
Based on comments from @jmarshall in samtools/hts-specs#124 I've added brace quoting as an experiment, just to see how it feels code wise as well as usage wise. Seems workable IMO. If this is accepted as a general way forward (irrespective of the usual tweaks in review) then I'll look at a samtools patch too. Eg bedidx.c calls hts_parse_reg directly and could benefit from being given the bam header. We also need to figure out how this fits with bcftools, but that's a bigger task. Maybe before we finalise this we need to determine whether the new API needs an additional argument about how parsing is done (whether chr:1 is 1-1 or 1-end) so we can share the common code. Even if not done now, making provision for it in the API is key. |
Further points for discussion:
Some of these could just be implicit. I don't think there is an issue parsing with strtod instead of strtol regardless, and spotting K, M, G etc is also upwardly compatible, so maybe they just need to be there always. Similarly chr1:-555 ought to work in Samtools as it's also compatible with chr1:1-555. The key distinction though is the bcftools vs samtools parsing for single value ranges with the different interpretation for region end. Edit: one more for the list
We'd need to supplement this with bam_parse_region, bcf_parse_region etc functions that supply appropriate parameters to ease usage for that header type and the expected set of parameters. |
👍 that's the sort of thing I had in mind back in #260. And something like HTS_PARSE_SINGLE so that bcftools could be brought into the fold was first on the list 😄 I'd suggest adding a new all-singing all-dancing And as you say, adding |
Mutter mutter. Yet another nuance. So, bcftools Also the bcftools version has chr_beg and chr_end so {HLA:1:2:3} can return chr_beg pointing to 'H' and chr_end pointing to '}'. This is useful as the single return value from hts_parse_reg doesn't cut it when we may, or may not, start with '{' or even white space (this is removed in the bcftools version). I'm thinking it has the saner interface here. Finally bcftools mpileup foo.bam takes regions parsed according to bcftools conventions, not according to SAM conventions. Ie it is the tool that defines the parsing rules and not the file format. Hence our What a quagmire. :-( It's asymptotically getting there though. |
Yes. What you're saying is that
All this returning pointers to the (start/)end of the chromosome name within the input string is a hack based not knowing the table of expected contig names and needing to return the chr substring in some C language-compatible way (and not wanting to malloc or copy it). If your super new function is given the table of expected contig names (or, in general, a callback function that returns a This would mean it wouldn't support not being given a table of expected contig names, but in a way that would be a good thing. In all the current htslib/bcftools/samtools sites where |
Yes, because at present I wrap the old API in terms of calling the new API, and the old API isn't given this information. Naturally we can (and should) update all samtools commands to be using the new API (most are already due to iterators), but we cannot assume all third party library users will do this. Either we accept this is optional, or we duplicate the code and have both old and new functions in full. (For API sanity the latter may well be preferable.) I was also considering having a returned structure. Eg:
I'd assumed that actually where chr_beg and chr_end (and the equivalent char* return value from htslib) were used was both for the header checking later on, but also for messages. That said, the latter is easily resolved by the reverse bam_id2name equivalent function (it's not called that, but both sam and vcf headers have equivalents) and you're correct the external (samtools) usage of hts_parse_reg has header available. Even to the extent of almost writing the same function again - see bam_parse_region in samtools/bam_aux.c for example! It's a little complex in bcftools though as the equivalent over there uses regidx_push with strings and then does the index lookup itself in a hash table. This is sort of equivalent to bam_name2id and co, but with the subtlety of being used elsewhere for UTR and Exon names plus the ability to dynamically grow the dictionary and assign new tids on the fly for unseen data. This still fits in with a getid function pointer usage though so it's all workable, just a bit more work. Certainly less complexity in the API is worth a bit more grunt work elsewhere. As far as return values go, I think returning the end of the parsed string is sensible as it permits lists reg,reg,reg... to be parsed in a simple loop. (NULL for failure. *ret == '\0' for list end). It's more flexible than an integer error code and avoids the need for an additional argument |
That wasn't my question. My question was: where the old API is called, do the callers have that information at their fingertips (even though the old API doesn't ask for it)? If so, the callers would be in a position to give it to a new API. You can hack such that the old API can be written in terms of the new API, or you can design something good… 😄 I'd still suggest adding a new all-singing all-dancing hts_parse_region() with a sensible calling convention unrelated to that of hts_parse_reg(), and deprecating hts_parse_reg() and leaving it as it is.
This is an interesting point, and also interesting that a callback function can be given something it hasn't seen before and say "yes that's a valid contig name" and return a newly-assigned That wouldn't work too well with asking the callback whether "foo" or "foo:1-200" is the contig part of "foo:1-200" though. Ideally the callback would have a "confidence" argument to signal the context of the putative contig name… |
I think if it's called with a callback function that just accepts new strings then that's fine, but it can no longer do the same level of sanity checking. We lose nothing though as it already has that issue right now too. See bcftools csq.c and the regidx_push implementation. Anyway I'm heeding your advice and making the new API more draconian. It'll have the various parameters as mandatory now, which simplifies the implementation somewhat too. Yes all the code I've seen calling the old API that I've found so far does have some way in getting the header. I've also taken the liberty of making all parsers cope with chrX:100- to mean 100 to end (and -100 to mean 1-100). This is a change in behaviour as right now with samtools -100 means 0 to end (-100 is beginning and it's clipped to zero), but frankly I view that as a bug! This can of worms is really quite wriggly. :-) |
Totally agree with making sensible things happen when you omit either end of I am coming around to your |
…except what about keeping |
For now at least it's the simpler API. I'm testing whether I can get it working universally with samtools and bcftools now, before amending the htslib side from WIP to a PR for actual consideration. There is one other benefit to using a structure though - the data types returned are under control of the API and not just &local_int. The latter, as we can see elsewhere, breaks if we change it to int64_t or hts_pos_t. Not that it matters mind as once we've done it I doubt we'll be changing them again! Anyway, I'll see after my explorations with bcftools whether an extra argument actually buys us anything. If it does, then a single structure is the way to go rather the multiple chr_beg, chr_end etc args. |
Regarding multiple return values, I see the usage at the moment of hts_parse_reg in samtools is by and large just for error reporting. Eg see samtools view:
This is perhaps over the top. A separate return value could indeed allow us to distinguish these cases, but a joint "Malformed region or unknown reference" error is probably sufficient, especially given the old (and new) code already writes to stderr if it spots dodgy text in hts_parse_decimal. If we really care about being specific, then it's possible we could use something like tid as the error anyway. Ie if NULL is returned, see tid for the response code. (-1 for unknown, as it was before, something else for parse error). Personally I think a single works or doesn't is sufficient. I don't want to spend too long of my life catering to documenting all the details of how it can go wrong! ;-) |
One advantage of an integer return value is that you can add |
It also turns out that the samtools bedidx code could benefit from having the chr start and end pointers available (or start + length). It's not strictly necessary as we could hash on tid instead, but it's substantially more code change (arguably an improvement though). Anyway, comments on a postcard as usual... |
Re a joint "Malformed region or unknown reference" error is probably sufficient: https://www.biostars.org/p/320129/ is a good example of the value of individual specific error messages. Quick, at a glance, what's wrong with the region in the question? It turns out the two error messages in the code you quoted are there because previously there was a single (not general enough, hence misleading) error message; see samtools/samtools#353. Just some data points… 😄 |
I saw 353 and agree it's an incorrect error message, but that wasn't quite my point. Incorrect errors are far worse than general non-specific errors. Hence "Unknown reference or failure to parse region string" at least gives you a clue it may be the latter. Same with the end<start scenario on biostars (I saw it instantly, but I've been looking at the code and know what makes it fail). I think if we want more specific errors though, we should just use hts_log_warning/error in htslib itself as it's far easier than expecting every call to this to report correctly, plus we simply don't have a proper error reporting system for htslib. Ideally we'd have hts_errno and hts_perror style functionality, but that's a totally different PR. :-) However I'm happy to concede to having an int return for err code and another way of returning the next pointer location, eg via a struct return as suggested before. This gives some flexibility. I'm abivalent as to whether we should have tid, beg, end explicit and not part of the struct. On one hand all-in-struct is tidier and also enforces correct use of data types (eg when we migrate to 64-bit integers), but on the other hand there are 3 essentially required arguments that everyone must deal with plus optional things for the rare case. (The rare case here being bcftools; rare because I really can't see myself tackling the complexity of the parsing interactions there with this round.) |
Note this may not actually be working despite travis/appveyor claiming it does. As requested the new function now fills out int64_t, but I haven't upgraded everything else to handle 64-bit pos as that's a different PR already. As a consequence, Either way, ignore any problems this PR makes as it is, now, expected to only work when used in conjunction with #709. |
This extends the callback function used by the hts iterator code to the ref:start-end range parser, permitting the parser to validate whether the whole thing matches a reference name. This works around parsing conflicts when given range queries like "HLA-DRB1*12:17", which is either the whole of "HLA-DRB1*12:17" (this exists in GRCh38) or base 17 onwards of ""HLA-DRB1*12" (which does not). Note there are still some undecided questions here. Do we want to handle the special names used in the iterator at this level to? Eg "*" meaning unmapped data and "." meaning whole file?
Also improved return values so NULL is primary way of detecting failure rather than tid. This is more in line with the old hts_parse_reg code. See samtools/hts-specs#124 (comment) for heuristic suggestions.
Eg with contigs named "chr1" and "chr1:100-200" we can specify "{chr1}:100-200" and "{chr1:100-200}" to disambiguate.
The old function (hts_parse_reg) has been put back, meaning the new API can now have the getid func pointer as a mandatory requirement. Added tests. Tweaked sam_parse_region to have the flags parameter. This is still required as bcftools and samtools use different parameters (parsing style is tool based rather than file format based).
Previously it had its own custom parser which nearly, but not quite, matched the old hts_parse_reg code in functionality. Note this changes the fasta coordinate type from long to int, which is a backwards step. However it is expected this will subsequently change to hts_pos_t in another PR.
Mainly copied from the useful (but not so visible) comment before the function definition in hts.c.
Hash table can now use tid (cast to khash32_t, which is unsigned) as key instead of the region name, avoiding some string copying.
Fixed bug which broke samtools, updated hts_reglist_create() to use the new interface and put a bit more documentation in the header files. |
It looks like the work around mentioned in 0eac47d brings back the SIZE_MAX macro, which some glibc versions don't include for c++. |
Ah, yes, I'd forgotten about the c++ SIZE_MAX problem. Also, looking at the last fix for this (#772) the possible overflow should already be detected. We'll look into it to see if there's a better way of solving this. |
I'm not sure if |
Fixed by 164d507. gcc had managed to find the one value of |
This resolves #705 which is HLA specific with a more general workaround for badly named references.
Note this has some open questions that need resolving, either by deleting the FIXME in the code or actually implementing it. As such this PR is a work in progress for discussion.
Specifically: