-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extreme memory consumption when reading certain SRA records? #31
Comments
The big difference between these two runs is ERR191522 contains alignments and DRR001375 contains only unaligned reads. In the unaligned case, the reads are stored together in one record. The access pattern in your code snippet will be in the same order as the data is laid out in the file. In the aligned case, the reads are stored in the order they align on the reference. The two mate pairs are not stored together and might be far apart in the file. Your code snippet will reconstruct the whole read which requires random access I/O to the file. Once upon a time, this wasn't a big problem, but these days big disks are structured as bucket stores with random I/O being very expensive. To try to ameliorate this, we have tuned our underlying database technology to cache aggressively. If your usage does NOT require both mate pairs, it could be sped up. |
We are creating an internal ticket to look into this behavior. Thanks for the report and the graphs! |
For the particular problem we're trying to solve, we do not need to access mate pairs together (or in any particular order). Is there a code example that illustrates how to access read data without the need for aggressive caching or random access I/O? |
When a run that is aligned has been detected, the reads will be available first through their primary alignments. If these are retrieved as individual reads in absence of quality scores, they will generate the best performance. The remaining reads will be those that did not align, and they can be fetched separately. You start with a Once finished, look for unaligned reads via It's also useful to avoid converting a Okay, I think that's it - off the top of my head. If I left anything out, @durbrow please correct me! |
When counting the number of sequences in the primary alignment and unaligned reads, I get a slightly different sequence count from the value obtained by iterating over For example, in SRR10742149:
Total number of sequences = 83490707 However, there are 83491358 sequences in 41745679 reads, using |
Yes, I probably should have mentioned that. As you see, For the cases where one mate is aligned but the other not, these are partially aligned and we don't have a nice, clean category for them. If you ask for all but iterate as fragments, this may be what you need. |
Great to hear! One of the things that is problematic is assembling the mates. If you can take the mates in isolation, you'll have a better day. |
While the above strategy of first loading primary alignments and then loading unaligned reads works with many (most?) SRA records (like the above mentioned ERR191522), there appear to be some SRA records for which this strategy generates an error. For example, all of the 2007231 primary alignments in ERR634825 can be read, but the following C++ exception is thrown when attempting iterate through the 846553 unaligned reads: |
Ah yes, aligned colorspace! They are broken, and unfortunately they can't (*) be fixed. Is it crucial for you to have these reads? Although I don't have a count, I do know there aren't very many these in the SRA. * the people who decide such things decided it wasn't worth the expense |
These reads are not crucial. However, is it possible to interrogate an As it stands, I can identify records that can successfully load 100% of the |
I am reading sequence data from SRA records by first downloading the SRA record with the
prefetch
command and then iterating through the file using the C++ interface (version 2.10.8), i.e.:In general, this approach seems to work well. However, I have noticed that for some SRA records (like ERR191522), there is (a) significant memory consumption and (b) a dramatic slow-down when iterating through the file. The following plot shows the speed (in reads per second) and memory consumption (from
/proc/meminfo
, reported as a fraction of total system memory):Other SRA records seem to be fine. For DRR001375, the following graph shows fairly constant speed and memory usage:
Is there a way to read SRA records, like ERR191522, without the large memory consumption? If not, is there a way to identify SRA records (in advance) that will exhibit this behavior (as the available RAM on on cluster instances can easily be exhausted while processing a single SRA record).
The text was updated successfully, but these errors were encountered: