-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for mips/mipsel #99
base: develop
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,6 +30,7 @@ DEALINGS IN THE SOFTWARE. */ | |
|
||
#include "htslib/sam.h" | ||
#include "htslib/kstring.h" | ||
#include "cram/os.h" | ||
|
||
int status; | ||
|
||
|
@@ -102,7 +103,11 @@ static int aux_fields1(void) | |
fail("XH field is \"%s\", expected \"%s\"", bam_aux2Z(p), BEEF); | ||
|
||
// TODO Invent and use bam_aux2B() | ||
#ifdef SP_LITTLE_ENDIAN | ||
if ((p = check_bam_aux_get(aln, "XB", 'B')) && memcmp(p, "Bc\3\0\0\0\xfe\x00\x02", 9) != 0) | ||
#elif defined SP_BIG_ENDIAN | ||
if ((p = check_bam_aux_get(aln, "XB", 'B')) && memcmp(p, "Bc\0\0\0\3\xfe\x00\x02", 9) != 0) | ||
#endif | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change would mean that the data in the file became dependent on host endianness, which is clearly wrong -- for interchange, the file contents are the same regardless of host endianness. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like that it is not data from file that is being compared here, but data from an internal buffer previously stored in host byte order. To expand on this, function This particular 4 byte integer value is copied to the internal buffer in sam.c:830
Four bytes of integer variable Finally, at the end of
Also, I came upon a comment in function
As far as I can see, this practice is also followed for other formats, e.g. when data is read from input BAM files and/or written to output BAM files, function Taking this into account, the change from As I am not an expert on the subject, I could be wrong, so any suggestions on how to correctly support htslib for mips/mipsel are most welcome. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed that perhaps it is the internal memory which is in host order rather than the file. For CRAM I finally managed to build (yawn) Staden io_lib-1.13.7 on an emulated big endian machine. It worked out of the box and produces both BAM and CRAM which I can decode on an Intel linux box. I tried the opposite direction and that also works. So the on-disk formats are working fine there. Obviously io_lib isn't the same as htslib (in particular the BAM component is totally different), but the CRAM source has a common ancestry and will be synced again shortly after the 1.0 release (once CRAM v3.0 becomes finalised). So I'm wary of changes to cram_encode.c given they apparently aren't necessary within io_lib. On little endian machines the in-memory data structure and on-disk data structure is identical. There are clearly two approaches here. 1) Keep the in-memory data structure the same as the on-disk structure and byte swap just-in-time when accessing. 2) Byte swap the in-memory data structures on reading so manipulation is easy, and then byte swap back again just before writing. Given the need to prevent unaligned accesses too, option 1 is often easier, although it can be cheated a bit (in my own implementation I deliberately pad out the read name with nuls to make the CIGAR field word aligned again, as a faster alternative to byte-by-byte loading). I suspect it is differences in choices of method 1 vs 2 between io_lib and htslib that causes the CRAM code to not work properly on big endian machines within htslib, but I haven't yet tested that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Apologies, you are correct re this code: the data is in host-endianness but still packed for wire-alignment... despite having added that comment, I had forgotten just how baroque this code is! So test/sam.c does indeed need fixing, though the right fix is to check There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are still unaligned memory accesses in sam.c. Eg line 729:
I'll experiment some more. |
||
fail("XB field is %c,..., expected c,-2,0,+2", p[1]); | ||
|
||
if ((p = check_bam_aux_get(aln, "ZZ", 'I')) && bam_aux2i(p) != 1000000) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change appears to change the interpretation of data in the input file depending on the host endianness. Have you tested this -- does this chunk really improve matters on MIPS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to the first note, aux is a pointer to auxiliary data kept in host byte order.
Four bytes starting at
aux[4]
are a 4-byte integer value representing the number of elements in an auxiliary data array and byte swapping is done to convert data (counter) from big endian to little endian format before being written out.PS. With these changes htslib builds fine on both mips and mipsel with all of the tests passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
aux
points to data in host endianness, then the way to copy it tocount
is withcount = *(uint32_t *)&aux[4]
, or less appallingly withmemcpy
.The original bit-shift code is the portable way to convert unaligned wire-endian data to a host integer. So if
aux
points to data in wire endianness, then the original code is correct.In the light of @jkbonfield's notes elsewhere on this commit, we should take care to figure out just what kind of data
aux
points to here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Wed, Jun 04, 2014 at 06:57:23AM -0700, John Marshall wrote:
count = *(uint32_t *)&aux[4]
could do an uninitialised memory accessso it would need to be verbatim loaded.
The main question here is over how best to deal with byte swapping.
For io_lib I byte swap the main structure (ie bam1_core_t) during
loading (and back again just before saving) as there are lots of
things which historically access these directly rather than using
access functions/macros. I also byte swap the cigar array.
I don't however make any attempt to byte swap the auxiliary data as
it's complex and potentially unnecessary. It could perhaps be
said that there are also going to be code which munges this directly,
but I obviously had the luxury of not caring. :-)
Htslib here differs as it calls swap_data which byte swaps the entire
auxiliary data too. (This is where my CRAM code failed as it wasn't
dealing with this correctly.) Arguably htslib's approach is correct
given that there will be code out there that munges aux manually, and
may not be as inefficient as it ininitially sounds[1].
My original code CRAM was carefully constructing the integer byte by
byte specifcally take make it endianness agnostic, but obviously
doesn't work due to the differences in in-memory representations.
Clearly this needs fixing, and I agree the fix is appropriate to put
into cram_decode.c.
It's probably best done with something like:
count = le_int4(ua_int4((uint32_t *)&aux[4]));
with:
le_int4 from os.h or
#ifdef BIG_ENDIAN
static inline le_int4(uint32_t x) {
return (((x & 0x000000ff) << 24) +
((x & 0x0000ff00) << 8) +
((x & 0x00ff0000) >> 8) +
((x & 0xff000000) >> 24));
}
#else
static inline le_int4(int32_t x) { return x; }
#endif
#ifdef ALLOW_UAC
uint32_t ua_int4(uint32_t *x) { return *x; }
#else
uint32_t ua_int4(uint32_t *x) {
unsigned char *c = (unsigned char *)x;
// correct endianness?
return c[0] + (c[1]<<8) + (c[2]<<16) + (c[3]<<24);
#endif
Ie we can have some standard functions for memory fetching in cases
where we may be unaligned, and use these throughout. The systems
that support unaligned access just inline it to the obvious memory
fetch while other systems do the appropriate magic to fetch the data.
Of course le_int4(ua_int4(...)) combined could probably be written
more efficiently so perhaps we want a compound function too. I think
there are also various gcc primitives now to do some of these byte
swapping operations so again it's best done in a macro or static
inline function (ideally) so they can all be rewritten in the optimal
way for the current build host.
James
[1] Long term, once we add support for specifying what fields are
required for decoding (very useful for CRAM as it can significantly
speed up reading), we can also use that same hint to avoid byte
swapping data we won't be reading. This neatly avoids the worries
about inefficiently byte swapping auxiliary data when we're only, say,
doing a samtools flagstat.
Yes, alas it doesn't for htslib which is where the bug lies. My fault
in not checking.
James
James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still doesn't work. The version here seems to pass make check in that it creates crams and decodes them, but they are not interoperable with crams on other systems due to no byte swapping being used.
It starts with the assumption that the auxiliary data is encoded in the file endianness, not cpu/memory endianness. I'll look into how best to do this. It seems it will need on-the-fly byte swapping for aux data in cram_encode and cram_decode. This shouldn't be too hard to implement.
I have been playing with a variant of this pull request in my own branch:
https://github.com/jkbonfield/htslib/tree/SPARC
The main change here is ua_read/write functions in hts.h for performing memory accesses that may access unaligned memory. (Arguably we can avoid these completely for CIGAR strings by nul padding the name by 1, 2, 3 or 4 nuls instead of just 1 to ensure the cigar is word aligned again - I doubt it would break anything.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further fixes in https://github.com/jkbonfield/htslib/tree/SPARC
I have now tested that branch with both make check (htslib only) and also ported sam, bam and cram in both directions between sparc and intel and verified they are compatible. (I didn't test BCF, sorry.)
This seems to be a more complete test that the one provided in this pull request, but I'm not sure what to do with it. Ie do I do my own pull request? If so to whom? Hence I'm passing the buck and will let someone just look over this branch and merge in :-)
James
PS. The SPARC uncovered lots and lots of unaligned memory accesses. So many that I can only assume the MIPS system does indeed permit these without throwing up an error, albeit possibly not in an efficient manner. Hopefully my ua_* functions will compile down to the same code they originally replaced on x86 systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, James
I have checked out your 'SPARC' branch of htslib (https://github.com/jkbonfield/htslib/tree/SPARC) and it builds fine on mips/mipsel with all of the tests passing.
However, htslib from https://github.com/samtools/htslib/ is still failing to build from source on mips and mipsel.
I am will look into it and let you know what I found out.
Best Regards
Aleksandar Zlicic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbonfield , Could you please rebase the https://github.com/jkbonfield/htslib/tree/SPARC branch?
The status right now is: This branch is 8 commits ahead, 96 commits behind samtools:develop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Sat, Oct 04, 2014 at 10:51:14AM -0700, domibel wrote:
I'm wary of rebasing it as it's been in a public repository already
and I haven't been able to verify it actually works in the newer form
(I don't have the SPARC any more).
So I did a merge and conflict fix. It was a total mess as some of
the original pull request has been merged in with the main develop
branch, but not all (and some got rewritten). Also since then there
has been mass tab vs space changing to add more fun into the mix.
My patch is primarily therefore simply the addition of ua_read/write
macros for supporting unaligned memory access. It's probably better
off to just diff against develop and recode the changes so it's safe
from N-way merging cruft and clean, however I don't have any local
hardware to test on.
James
James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
James, Thank you so much for the merge.
I can confirm that the tests are now passing on sparc. The only issue I see is a compiler warning:
I compiled samtools(develop) on top of your branch. Here are the first 2 failing tests:
So this looks like an samtools issue to me. So would it be possible to merge your changes into the official htslib tree?
-Dominique