Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for mips/mipsel #99

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 21 additions & 4 deletions cram/cram_encode.c
Original file line number Diff line number Diff line change
Expand Up @@ -1858,6 +1858,12 @@ static char *cram_encode_aux_1_0(cram_fd *fd, bam_seq_t *b, cram_container *c,
return rg;
}

static inline int is_big_endian(){
int x = 0x01;
char *c = (char*)&x;
return (c[0] != 0x01);
}

/*
* Encodes auxiliary data. Largely duplicated from above, but done so to
* keep it simple and avoid a myriad of version ifs.
Expand Down Expand Up @@ -1949,10 +1955,21 @@ static char *cram_encode_aux(cram_fd *fd, bam_seq_t *b, cram_container *c,

case 'B': {
int type = aux[3], blen;
uint32_t count = (uint32_t)((((unsigned char *)aux)[4]<< 0) +
(((unsigned char *)aux)[5]<< 8) +
(((unsigned char *)aux)[6]<<16) +
(((unsigned char *)aux)[7]<<24));
uint32_t count;
if(is_big_endian())
{
count = (uint32_t)((((unsigned char *)aux)[7]<< 0) +
(((unsigned char *)aux)[6]<< 8) +
(((unsigned char *)aux)[5]<<16) +
(((unsigned char *)aux)[4]<<24));
}
else
{
count = (uint32_t)((((unsigned char *)aux)[4]<< 0) +
(((unsigned char *)aux)[5]<< 8) +
(((unsigned char *)aux)[6]<<16) +
(((unsigned char *)aux)[7]<<24));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change appears to change the interpretation of data in the input file depending on the host endianness. Have you tested this -- does this chunk really improve matters on MIPS?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the first note, aux is a pointer to auxiliary data kept in host byte order.
Four bytes starting at aux[4] are a 4-byte integer value representing the number of elements in an auxiliary data array and byte swapping is done to convert data (counter) from big endian to little endian format before being written out.

PS. With these changes htslib builds fine on both mips and mipsel with all of the tests passing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If aux points to data in host endianness, then the way to copy it to count is with count = *(uint32_t *)&aux[4], or less appallingly with memcpy.

The original bit-shift code is the portable way to convert unaligned wire-endian data to a host integer. So if aux points to data in wire endianness, then the original code is correct.

In the light of @jkbonfield's notes elsewhere on this commit, we should take care to figure out just what kind of data aux points to here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Wed, Jun 04, 2014 at 06:57:23AM -0700, John Marshall wrote:

  •               (((unsigned char *)aux)[7]<<24));
    
  •   uint32_t count;
    
  •   if(is_big_endian())
    
  •   {
    
  •       count = (uint32_t)((((unsigned char *)aux)[7]<< 0) +
    
  •           (((unsigned char *)aux)[6]<< 8) +
    
  •           (((unsigned char *)aux)[5]<<16) +
    
  •           (((unsigned char *)aux)[4]<<24));
    
  •   }
    
  •   else
    
  •   {
    
  •       count = (uint32_t)((((unsigned char *)aux)[4]<< 0) +
    
  •           (((unsigned char *)aux)[5]<< 8) +
    
  •           (((unsigned char *)aux)[6]<<16) +
    
  •           (((unsigned char *)aux)[7]<<24));
    
  •   }
    

If aux points to data in host endianness, then the way to copy it
to count is with count = *(uint32_t *)&aux[4], or less appallingly
with memcpy.

count = *(uint32_t *)&aux[4] could do an uninitialised memory access
so it would need to be verbatim loaded.

The main question here is over how best to deal with byte swapping.
For io_lib I byte swap the main structure (ie bam1_core_t) during
loading (and back again just before saving) as there are lots of
things which historically access these directly rather than using
access functions/macros. I also byte swap the cigar array.

I don't however make any attempt to byte swap the auxiliary data as
it's complex and potentially unnecessary. It could perhaps be
said that there are also going to be code which munges this directly,
but I obviously had the luxury of not caring. :-)

Htslib here differs as it calls swap_data which byte swaps the entire
auxiliary data too. (This is where my CRAM code failed as it wasn't
dealing with this correctly.) Arguably htslib's approach is correct
given that there will be code out there that munges aux manually, and
may not be as inefficient as it ininitially sounds[1].

My original code CRAM was carefully constructing the integer byte by
byte specifcally take make it endianness agnostic, but obviously
doesn't work due to the differences in in-memory representations.
Clearly this needs fixing, and I agree the fix is appropriate to put
into cram_decode.c.

It's probably best done with something like:

count = le_int4(ua_int4((uint32_t *)&aux[4]));

with:

le_int4 from os.h or

#ifdef BIG_ENDIAN
static inline le_int4(uint32_t x) {
return (((x & 0x000000ff) << 24) +
((x & 0x0000ff00) << 8) +
((x & 0x00ff0000) >> 8) +
((x & 0xff000000) >> 24));
}
#else
static inline le_int4(int32_t x) { return x; }
#endif

#ifdef ALLOW_UAC
uint32_t ua_int4(uint32_t *x) { return *x; }
#else
uint32_t ua_int4(uint32_t *x) {
unsigned char *c = (unsigned char *)x;
// correct endianness?
return c[0] + (c[1]<<8) + (c[2]<<16) + (c[3]<<24);
#endif

Ie we can have some standard functions for memory fetching in cases
where we may be unaligned, and use these throughout. The systems
that support unaligned access just inline it to the obvious memory
fetch while other systems do the appropriate magic to fetch the data.

Of course le_int4(ua_int4(...)) combined could probably be written
more efficiently so perhaps we want a compound function too. I think
there are also various gcc primitives now to do some of these byte
swapping operations so again it's best done in a macro or static
inline function (ideally) so they can all be rewritten in the optimal
way for the current build host.

James

[1] Long term, once we add support for specifying what fields are
required for decoding (very useful for CRAM as it can significantly
speed up reading), we can also use that same hint to avoid byte
swapping data we won't be reading. This neatly avoids the worries
about inefficiently byte swapping auxiliary data when we're only, say,
doing a samtools flagstat.

The original bit-shift code is the portable way to convert unaligned
wire-endian data to a host integer. So if aux points to data in
wire endianness, then the original code is correct.

Yes, alas it doesn't for htslib which is where the bug lies. My fault
in not checking.

James

James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still doesn't work. The version here seems to pass make check in that it creates crams and decodes them, but they are not interoperable with crams on other systems due to no byte swapping being used.

It starts with the assumption that the auxiliary data is encoded in the file endianness, not cpu/memory endianness. I'll look into how best to do this. It seems it will need on-the-fly byte swapping for aux data in cram_encode and cram_decode. This shouldn't be too hard to implement.

I have been playing with a variant of this pull request in my own branch:

https://github.com/jkbonfield/htslib/tree/SPARC

The main change here is ua_read/write functions in hts.h for performing memory accesses that may access unaligned memory. (Arguably we can avoid these completely for CIGAR strings by nul padding the name by 1, 2, 3 or 4 nuls instead of just 1 to ensure the cigar is word aligned again - I doubt it would break anything.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further fixes in https://github.com/jkbonfield/htslib/tree/SPARC

I have now tested that branch with both make check (htslib only) and also ported sam, bam and cram in both directions between sparc and intel and verified they are compatible. (I didn't test BCF, sorry.)

This seems to be a more complete test that the one provided in this pull request, but I'm not sure what to do with it. Ie do I do my own pull request? If so to whom? Hence I'm passing the buck and will let someone just look over this branch and merge in :-)

James

PS. The SPARC uncovered lots and lots of unaligned memory accesses. So many that I can only assume the MIPS system does indeed permit these without throwing up an error, albeit possibly not in an efficient manner. Hopefully my ua_* functions will compile down to the same code they originally replaced on x86 systems.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, James

I have checked out your 'SPARC' branch of htslib (https://github.com/jkbonfield/htslib/tree/SPARC) and it builds fine on mips/mipsel with all of the tests passing.

However, htslib from https://github.com/samtools/htslib/ is still failing to build from source on mips and mipsel.

I am will look into it and let you know what I found out.

Best Regards
Aleksandar Zlicic

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbonfield , Could you please rebase the https://github.com/jkbonfield/htslib/tree/SPARC branch?
The status right now is: This branch is 8 commits ahead, 96 commits behind samtools:develop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Sat, Oct 04, 2014 at 10:51:14AM -0700, domibel wrote:

@jkbonfield , Could you please update the https://github.com/jkbonfield/htslib/tree/SPARC branch?
The status right now is: This branch is 8 commits ahead, 96 commits behind samtools:develop

I'm wary of rebasing it as it's been in a public repository already
and I haven't been able to verify it actually works in the newer form
(I don't have the SPARC any more).

So I did a merge and conflict fix. It was a total mess as some of
the original pull request has been merged in with the main develop
branch, but not all (and some got rewritten). Also since then there
has been mass tab vs space changing to add more fun into the mix.

My patch is primarily therefore simply the addition of ua_read/write
macros for supporting unaligned memory access. It's probably better
off to just diff against develop and recode the changes so it's safe
from N-way merging cruft and clean, however I don't have any local
hardware to test on.

James

James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

James, Thank you so much for the merge.

I can confirm that the tests are now passing on sparc. The only issue I see is a compiler warning:

cram/cram_encode.c: In function 'process_one_read':
cram/cram_encode.c:2288:17: warning: pointer targets in passing argument 1 of 'ua_write4s' differ in signedness [-Wpointer-sign]
      ua_write4s(&cig_to[i], ua_read4s(&cig_from[i]));
                 ^
In file included from ./htslib/sam.h:30:0,
                 from ./cram/cram_samtools.h:61,
                 from ./cram/cram.h:47,
                 from cram/cram_encode.c:46:
./htslib/hts.h:211:20: note: expected 'uint32_t *' but argument is of type 'int32_t *'
 static inline void ua_write4s(uint32_t *ip, int32_t v) {
                    ^

I compiled samtools(develop) on top of your branch. Here are the first 2 failing tests:

test_index:
    /home/domibel/samtools/samtools view /tmp/W1Ed9QWpit/large_chrom.bam ref2:1-541556283

    The outputs stdout differ:
        /home/domibel/samtools/test/dat/large_chrom.out
        /home/domibel/samtools/test/dat/large_chrom.out.new
.. failed ...

Comment: The outputfile is empty.
test_mpileup:
    /home/domibel/samtools/samtools mpileup -b /tmp/W1Ed9QWpit/mpileup.bam.list -f /tmp/W1Ed9QWpit/mpileup.ref.fa.gz -r17:100-150
/bin/bash: line 1: 21894 Bus error               ( /home/domibel/samtools/samtools mpileup -b /tmp/W1Ed9QWpit/mpileup.bam.list -f /tmp/W1Ed9QWpit/mpileup.ref.fa.gz -r17:100-150 ) 2> /tmp/leCKJhQAFB

.. failed ...

I looked with gdb at the issue:

Starting program: /home/domibel/samtools/samtools mpileup -b /tmp/7Ib0z6VZj2/mpileup.bam.list -f /tmp/7Ib0z6VZj2/mpileup.ref.fa.gz -r17:100-150
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/sparc-linux-gnu/libthread_db.so.1".
[mpileup] 3 samples in 3 input files
<mpileup> Set max per-file depth to 2666

Program received signal SIGBUS, Bus error.
resolve_cigar2 (s=<optimized out>, pos=17, p=0xd7768) at sam.c:1298
1298              if (_cop(cigar[0]) == BAM_CMATCH || _cop(cigar[0]) == BAM_CEQUAL || _cop(cigar[0]) == BAM_CDIFF) s->k = 0, s->x = c->pos, s->y = 0;
(gdb) bt 
#0  resolve_cigar2 (s=<optimized out>, pos=17, p=0xd7768) at sam.c:1298
#1  bam_plp_next (iter=0xd5230, _tid=0xffffcf30, _pos=0xffffcf34, _n_plp=0xd24f8) at sam.c:1643
#2  0x00072898 in bam_plp_auto (iter=0xd5230, _tid=0xffffcf30, _pos=0xffffcf34, _n_plp=0xd24f8) at sam.c:1722
#3  0x00072da0 in bam_mplp_auto (iter=iter@entry=0xd5368, _tid=_tid@entry=0xffffcff8, _pos=_pos@entry=0xffffcffc, n_plp=n_plp@entry=0xd0188, plp=plp@entry=0xd0178) at sam.c:1811
#4  0x000138f8 in mpileup (conf=conf@entry=0xffffd1c0, n=3, fn=0xd00e0) at bam_plcmd.c:441
#5  0x00016410 in bam_mpileup (argc=6, argv=<optimized out>) at bam_plcmd.c:880
#6  0xf7cd4a30 in __libc_start_main () from /lib/sparc-linux-gnu/libc.so.6
#7  0x00012b38 in _start ()

So this looks like an samtools issue to me. So would it be possible to merge your changes into the official htslib tree?

-Dominique

// skip TN field
aux+=3;

Expand Down
11 changes: 9 additions & 2 deletions cram/os.h
Original file line number Diff line number Diff line change
Expand Up @@ -88,8 +88,15 @@ extern "C" {
* processor type too.
*/

/* Set by autoconf */
#define SP_LITTLE_ENDIAN
#if !defined(SP_BIG_ENDIAN) && !defined(SP_LITTLE_ENDIAN)

# if __BYTE_ORDER == __BIG_ENDIAN
# define SP_BIG_ENDIAN
#elif __BYTE_ORDER == __LITTLE_ENDIAN
# define SP_LITTLE_ENDIAN
#endif

#endif

/* Mac FAT binaries or unknown. Auto detect based on CPU type */
#if !defined(SP_BIG_ENDIAN) && !defined(SP_LITTLE_ENDIAN)
Expand Down
32 changes: 32 additions & 0 deletions htslib/hts.h
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
#include <stddef.h>
#include <stdint.h>

#include "cram/os.h"

#ifndef HTS_BGZF_TYPEDEF
typedef struct BGZF BGZF;
#define HTS_BGZF_TYPEDEF
Expand Down Expand Up @@ -280,7 +282,17 @@ static inline uint16_t ed_swap_2(uint16_t v)
}
static inline void *ed_swap_2p(void *x)
{
#ifdef ALLOW_UAC
*(uint16_t*)x = ed_swap_2(*(uint16_t*)x);
#else
uint8_t tmpData[2];
uint16_t *ptmpData = (uint16_t*)&tmpData;
uint8_t *px = (uint8_t*)x;
int j;
for(j=0;j<2;j++) tmpData[j] = px[j];
*ptmpData = ed_swap_2(*ptmpData);
for(j=0;j<2;j++) px[j] = tmpData[j];
#endif
return x;
}
static inline uint32_t ed_swap_4(uint32_t v)
Expand All @@ -290,7 +302,17 @@ static inline uint32_t ed_swap_4(uint32_t v)
}
static inline void *ed_swap_4p(void *x)
{
#ifdef ALLOW_UAC
*(uint32_t*)x = ed_swap_4(*(uint32_t*)x);
#else
uint8_t tmpData[4];
uint32_t *ptmpData = (uint32_t*)&tmpData;
uint8_t *px = (uint8_t*)x;
int j;
for(j=0;j<4;j++) tmpData[j] = px[j];
*ptmpData = ed_swap_4(*ptmpData);
for(j=0;j<4;j++) px[j] = tmpData[j];
#endif
return x;
}
static inline uint64_t ed_swap_8(uint64_t v)
Expand All @@ -301,7 +323,17 @@ static inline uint64_t ed_swap_8(uint64_t v)
}
static inline void *ed_swap_8p(void *x)
{
#ifdef ALLOW_UAC
*(uint64_t*)x = ed_swap_8(*(uint64_t*)x);
#else
uint8_t tmpData[8];
uint64_t *ptmpData = (uint64_t*)&tmpData;
uint8_t *px = (uint8_t*)x;
int j;
for(j=0;j<8;j++) tmpData[j] = px[j];
*ptmpData = ed_swap_8(*ptmpData);
for(j=0;j<8;j++) px[j] = tmpData[j];
#endif
return x;
}

Expand Down
136 changes: 130 additions & 6 deletions sam.c
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
#include "htslib/sam.h"
#include "htslib/bgzf.h"
#include "cram/cram.h"
#include "cram/os.h"
#include "htslib/hfile.h"

#include "htslib/khash.h"
Expand Down Expand Up @@ -259,9 +260,14 @@ static inline int aux_type2size(uint8_t type)
}
}

static void swap_data(const bam1_core_t *c, int l_data, uint8_t *data)
typedef enum swap_data_rw {
SWAP_DATA_READ,
SWAP_DATA_WRITE
}swap_data_rw_t;

static void swap_data(const bam1_core_t *c, int l_data, uint8_t *data, swap_data_rw_t rw_mode)
{
uint8_t *s;
uint8_t *s, *s_tmp;
uint32_t *cigar = (uint32_t*)(data + c->l_qname);
uint32_t i, n;
s = data + c->n_cigar*4 + c->l_qname + c->l_qseq + (c->l_qseq + 1)/2;
Expand All @@ -282,13 +288,22 @@ static void swap_data(const bam1_core_t *c, int l_data, uint8_t *data)
break;
case 'B':
size = aux_type2size(*s); ++s;
ed_swap_4p(s); memcpy(&n, s, 4); s += 4;
if(SWAP_DATA_READ == rw_mode)
{
ed_swap_4p(s);
}
s_tmp = s;
memcpy(&n, s, 4); s += 4;
switch (size) {
case 1: s += n; break;
case 2: for (i = 0; i < n; ++i, s += 2) ed_swap_2p(s); break;
case 4: for (i = 0; i < n; ++i, s += 4) ed_swap_4p(s); break;
case 8: for (i = 0; i < n; ++i, s += 8) ed_swap_8p(s); break;
}
if(SWAP_DATA_WRITE == rw_mode)
{
ed_swap_4p(s_tmp);
}
break;
}
}
Expand Down Expand Up @@ -326,7 +341,7 @@ int bam_read1(BGZF *fp, bam1_t *b)
}
if (bgzf_read(fp, b->data, b->l_data) != b->l_data) return -4;
//b->l_aux = b->l_data - c->n_cigar * 4 - c->l_qname - c->l_qseq - (c->l_qseq+1)/2;
if (fp->is_be) swap_data(c, b->l_data, b->data);
if (fp->is_be) swap_data(c, b->l_data, b->data, SWAP_DATA_READ);
return 4 + block_len;
}

Expand All @@ -348,13 +363,13 @@ int bam_write1(BGZF *fp, const bam1_t *b)
for (i = 0; i < 8; ++i) ed_swap_4p(x + i);
y = block_len;
if (ok) ok = (bgzf_write(fp, ed_swap_4p(&y), 4) >= 0);
swap_data(c, b->l_data, b->data);
swap_data(c, b->l_data, b->data, SWAP_DATA_WRITE);
} else {
if (ok) ok = (bgzf_write(fp, &block_len, 4) >= 0);
}
if (ok) ok = (bgzf_write(fp, x, 32) >= 0);
if (ok) ok = (bgzf_write(fp, b->data, b->l_data) >= 0);
if (fp->is_be) swap_data(c, b->l_data, b->data);
if (fp->is_be) swap_data(c, b->l_data, b->data, SWAP_DATA_WRITE);
return ok? 4 + block_len : -1;
}

Expand Down Expand Up @@ -909,6 +924,10 @@ int sam_format1(const bam_hdr_t *h, const bam1_t *b, kstring_t *str)
s = bam_get_aux(b); // aux
while (s+4 <= b->data + b->l_data) {
uint8_t type, key[2];
#ifndef ALLOW_UAC
uint8_t tmpData[8];
int j;
#endif
key[0] = s[0]; key[1] = s[1];
s += 2; type = *s++;
kputc('\t', str); kputsn((char*)key, 2, str); kputc(':', str);
Expand All @@ -924,6 +943,7 @@ int sam_format1(const bam_hdr_t *h, const bam1_t *b, kstring_t *str)
kputsn("i:", 2, str);
kputw(*(int8_t*)s, str);
++s;
#ifdef ALLOW_UAC
} else if (type == 'S') {
if (s+2 <= b->data + b->l_data) {
kputsn("i:", 2, str);
Expand Down Expand Up @@ -959,6 +979,54 @@ int sam_format1(const bam_hdr_t *h, const bam1_t *b, kstring_t *str)
ksprintf(str, "d:%g", *(double*)s);
s += 8;
} else return -1;
#else
} else if (type == 'S') {
if (s+2 <= b->data + b->l_data) {
uint16_t *ptmpData = (uint16_t*)tmpData;
for(j=0;j<2;j++) tmpData[j]=s[j];
kputsn("i:", 2, str);
kputw(*ptmpData, str);
s += 2;
} else return -1;
} else if (type == 's') {
if (s+2 <= b->data + b->l_data) {
int16_t *ptmpData = (int16_t*)tmpData;
for(j=0;j<2;j++) tmpData[j]=s[j];
kputsn("i:", 2, str);
kputw(*ptmpData, str);
s += 2;
} else return -1;
} else if (type == 'I') {
if (s+4 <= b->data + b->l_data) {
uint32_t *ptmpData = (uint32_t*)tmpData;
for(j=0;j<4;j++) tmpData[j]=s[j];
kputsn("i:", 2, str);
kputuw(*ptmpData, str);
s += 4;
} else return -1;
} else if (type == 'i') {
if (s+4 <= b->data + b->l_data) {
int32_t *ptmpData = (int32_t*)tmpData;
for(j=0;j<4;j++) tmpData[j]=s[j];
kputsn("i:", 2, str);
kputw(*ptmpData, str);
s += 4;
} else return -1;
} else if (type == 'f') {
if (s+4 <= b->data + b->l_data) {
float *ptmpData = (float*)tmpData;
for(j=0;j<4;j++) tmpData[j]=s[j];
ksprintf(str, "f:%g", *ptmpData);
s += 4;
} else return -1;
} else if (type == 'd') {
if (s+8 <= b->data + b->l_data) {
double *ptmpData = (double*)tmpData;
for(j=0;j<8;j++) tmpData[j]=s[j];
ksprintf(str, "d:%g", *ptmpData);
s += 8;
} else return -1;
#endif
} else if (type == 'Z' || type == 'H') {
kputc(type, str); kputc(':', str);
while (s < b->data + b->l_data && *s) kputc(*s++, str);
Expand All @@ -977,11 +1045,49 @@ int sam_format1(const bam_hdr_t *h, const bam1_t *b, kstring_t *str)
kputc(',', str);
if ('c' == sub_type) { kputw(*(int8_t*)s, str); ++s; }
else if ('C' == sub_type) { kputw(*(uint8_t*)s, str); ++s; }
#ifdef ALLOW_UAC
else if ('s' == sub_type) { kputw(*(int16_t*)s, str); s += 2; }
else if ('S' == sub_type) { kputw(*(uint16_t*)s, str); s += 2; }
else if ('i' == sub_type) { kputw(*(int32_t*)s, str); s += 4; }
else if ('I' == sub_type) { kputuw(*(uint32_t*)s, str); s += 4; }
else if ('f' == sub_type) { ksprintf(str, "%g", *(float*)s); s += 4; }
#else
else if ('s' == sub_type)
{
int16_t *ptmpData = (int16_t*)tmpData;
for(j=0;j<2;j++)tmpData[j]=s[j];
kputw(*ptmpData, str);
s += 2;
}
else if ('S' == sub_type)
{
uint16_t *ptmpData = (uint16_t*)tmpData;
for(j=0;j<2;j++)tmpData[j]=s[j];
kputw(*ptmpData, str);
s += 2;
}
else if ('i' == sub_type)
{
int32_t *ptmpData = (int32_t*)tmpData;
for(j=0;j<4;j++)tmpData[j]=s[j];
kputw(*ptmpData, str);
s += 4;
}
else if ('I' == sub_type)
{
uint32_t *ptmpData = (uint32_t*)tmpData;
for(j=0;j<4;j++)tmpData[j]=s[j];
kputuw(*ptmpData, str);
s += 4;
}
else if ('f' == sub_type)
{
float *ptmpData = (float*)tmpData;
for(j=0;j<4;j++)tmpData[j]=s[j];
ksprintf(str, "%g", *ptmpData);
s += 4;
}
#endif
}
}
}
Expand Down Expand Up @@ -1071,20 +1177,38 @@ int32_t bam_aux2i(const uint8_t *s)
{
int type;
type = *s++;
#ifdef ALLOW_UAC
if (type == 'c') return (int32_t)*(int8_t*)s;
else if (type == 'C') return (int32_t)*(uint8_t*)s;
else if (type == 's') return (int32_t)*(int16_t*)s;
else if (type == 'S') return (int32_t)*(uint16_t*)s;
else if (type == 'i' || type == 'I') return *(int32_t*)s;
#else
uint8_t tmpData[4];
int j;
if (type == 'c') return (int32_t)*(int8_t*)s;
else if (type == 'C') return (int32_t)*(uint8_t*)s;
else if (type == 's'){ int16_t *ptmpData = (int16_t*)tmpData; for(j=0;j<2;j++)tmpData[j]=s[j]; return (int32_t)(*ptmpData);}
else if (type == 'S'){ uint16_t *ptmpData = (uint16_t*)tmpData; for(j=0;j<2;j++)tmpData[j]=s[j]; return (int32_t)(*ptmpData);}
else if (type == 'i'){ int32_t *ptmpData = (int32_t*)tmpData; for(j=0;j<4;j++)tmpData[j]=s[j]; return *ptmpData;}
else if (type == 'I'){ uint32_t *ptmpData = (uint32_t*)tmpData; for(j=0;j<4;j++)tmpData[j]=s[j]; return *ptmpData;}
#endif
else return 0;
}

double bam_aux2f(const uint8_t *s)
{
int type;
type = *s++;
#ifdef ALLOW_UAC
if (type == 'd') return *(double*)s;
else if (type == 'f') return *(float*)s;
#else
uint8_t tmpData[8];
int j;
if (type == 'd'){ double *ptmpData = (double*)tmpData; for(j=0;j<sizeof(double);j++)tmpData[j]=s[j]; return *ptmpData;}
else if (type == 'f'){ float *ptmpData = (float*)tmpData; for(j=0;j<sizeof(float);j++)tmpData[j]=s[j]; return *ptmpData;}
#endif
else return 0.0;
}

Expand Down
5 changes: 5 additions & 0 deletions test/sam.c
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ DEALINGS IN THE SOFTWARE. */

#include "htslib/sam.h"
#include "htslib/kstring.h"
#include "cram/os.h"

int status;

Expand Down Expand Up @@ -102,7 +103,11 @@ static int aux_fields1(void)
fail("XH field is \"%s\", expected \"%s\"", bam_aux2Z(p), BEEF);

// TODO Invent and use bam_aux2B()
#ifdef SP_LITTLE_ENDIAN
if ((p = check_bam_aux_get(aln, "XB", 'B')) && memcmp(p, "Bc\3\0\0\0\xfe\x00\x02", 9) != 0)
#elif defined SP_BIG_ENDIAN
if ((p = check_bam_aux_get(aln, "XB", 'B')) && memcmp(p, "Bc\0\0\0\3\xfe\x00\x02", 9) != 0)
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change would mean that the data in the file became dependent on host endianness, which is clearly wrong -- for interchange, the file contents are the same regardless of host endianness.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that it is not data from file that is being compared here, but data from an internal buffer previously stored in host byte order.

To expand on this, function check_bam_aux_get returns a pointer to auxiliary data which is stored in aln->data (aln is a variable of type bam1_t). This structure has previously been filled with data from input file in function sam_parse1 during reading of input file.

This particular 4 byte integer value is copied to the internal buffer in sam.c:830

830       kputc_('B', &str); kputc_(type, &str); kputsn_(&n, 4, &str);

kputsn_ is defined as (htslib/kstring.h:178):

static inline int kputsn_(const void *p, int l, kstring_t *s)
{
        if (s->l + l > s->m) {
                char *tmp;
                s->m = s->l + l;
                kroundup32(s->m);
                if ((tmp = (char*)realloc(s->s, s->m)))
                        s->s = tmp;
                else
                        return EOF;
        }
        memcpy(s->s + s->l, p, l);
        s->l += l;
        return l;
}

Four bytes of integer variable n are copied to (str->s+str->l) which is the address inside the internal buffer. The value is stored in host byte order, which is big endian in case of big endian mips.

Finally, at the end of sam_parse1, b->data is assigned to be str->s.

842         b->data = (uint8_t*)str.s; b->l_data = str.l; b->m_data = str.m;

Also, I came upon a comment in function sam_parse1 placed before the parsing of aux data (sam.c:778) that states:

// Note that (like the bam1_core_t fields) this aux data in b->data is
 // stored in host endianness; so there is no byte swapping needed here.

As far as I can see, this practice is also followed for other formats, e.g. when data is read from input BAM files and/or written to output BAM files, function swap_data (sam.c:268) is used for converting multibyte values from little endian to big endian format and vice versa, depending on the operation. Again, aux data is kept in host byte order.

Taking this into account, the change from "Bc\3\0\0\0\xfe\x00\x02" to "Bc\0\0\0\3\xfe\x00\x02" reflects the difference in data endianness (depending on the host), not a change in the input file itself.

As I am not an expert on the subject, I could be wrong, so any suggestions on how to correctly support htslib for mips/mipsel are most welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that perhaps it is the internal memory which is in host order rather than the file.

For CRAM I finally managed to build (yawn) Staden io_lib-1.13.7 on an emulated big endian machine. It worked out of the box and produces both BAM and CRAM which I can decode on an Intel linux box. I tried the opposite direction and that also works. So the on-disk formats are working fine there.

Obviously io_lib isn't the same as htslib (in particular the BAM component is totally different), but the CRAM source has a common ancestry and will be synced again shortly after the 1.0 release (once CRAM v3.0 becomes finalised). So I'm wary of changes to cram_encode.c given they apparently aren't necessary within io_lib.

On little endian machines the in-memory data structure and on-disk data structure is identical. There are clearly two approaches here. 1) Keep the in-memory data structure the same as the on-disk structure and byte swap just-in-time when accessing. 2) Byte swap the in-memory data structures on reading so manipulation is easy, and then byte swap back again just before writing. Given the need to prevent unaligned accesses too, option 1 is often easier, although it can be cheated a bit (in my own implementation I deliberately pad out the read name with nuls to make the CIGAR field word aligned again, as a faster alternative to byte-by-byte loading).

I suspect it is differences in choices of method 1 vs 2 between io_lib and htslib that causes the CRAM code to not work properly on big endian machines within htslib, but I haven't yet tested that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, you are correct re this code: the data is in host-endianness but still packed for wire-alignment... despite having added that comment, I had forgotten just how baroque this code is! So test/sam.c does indeed need fixing, though the right fix is to check p[2..5] as a host int (modulo alignment concerns) rather than as a string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still unaligned memory accesses in sam.c. Eg line 729:

cigar[i] = strtol(q, &q, 10)<<BAM_CIGAR_SHIFT;

I'll experiment some more.

fail("XB field is %c,..., expected c,-2,0,+2", p[1]);

if ((p = check_bam_aux_get(aln, "ZZ", 'I')) && bam_aux2i(p) != 1000000)
Expand Down
Loading