Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce maximum memory usage while reading SRA data? #97

Open
jgans opened this issue Mar 28, 2023 · 2 comments
Open

Reduce maximum memory usage while reading SRA data? #97

jgans opened this issue Mar 28, 2023 · 2 comments

Comments

@jgans
Copy link

jgans commented Mar 28, 2023

Using the C VDB API (and following the fasterq-dump utility strategy for accessing SRA records) for reading SRA data can consume a significant amount of RAM while reading an SRA record. This can be an issue when using attempting to minimize the amount of Cloud computing resources (i.e. instance RAM) when processing a large number of SRA records.

The maximum amount of RAM used while reading (as measured with /usr/bin/time -v) depends on the record:
image

While periodically calling VCursorRelease() and VCursorOpen() to force the VDB interface to deallocate RAM offers a minor reduction in the maximum amount of RAM used (about 25%), this strategy significantly slows down the rate at which an SRA record is read.

Is it possible/feasible to limit memory consumption using the VDB C API to sub-gigabyte levels, independent of the number of reads? The goal is to read through an SRA record once, as quickly as possible and using as little RAM as possible.

@durbrow
Copy link
Collaborator

durbrow commented Mar 28, 2023

Possible? Of course. Feasible? Not so much.

Most likely the memory is being used for caching the reference sequences. If you are using an access pattern like fasterq-dump, you will have more reference sequences loaded and actually being used.

The problem is thus:
The alignment records are stored in reference position order. The sequence, quality, and mate pairing information is stored in the order that the mate pairing information was completed. Generally, the mate pairs are only a "short" distance apart, but outliers are usually frequent enough (and randomly so) to blow out our caches.

If you don't care about pairing mates (immediately), you can employ a strategy of extract-and-sort, like fasterq-dump. If you don't care about preserving the order, you can regroup instead of sort. If you don't care about having the mates together, you can skip regrouping. If you don't care about quality scores (and you really shouldn't), you can make the above even faster, e.g. by using a data file without quality scores and by not extracting any quality scores. If you need even finer control, you would need to perform the decompression of the aligned sequences yourself, instead of having VDB's transform engine do it. If you are interested, I can give you more details.

@jgans
Copy link
Author

jgans commented Mar 28, 2023

Thank you for the advice! From the description you provided, it appears that the maximum amount of RAM memory used by VDB should be proportional to the total size of the reference sequences, as opposed to the total size of the reads in an SRA record. Since the amount of RAM I'm using to process an SRA record is already proportional to the total amount of read sequence, the amount of RAM used by VDB for caching reference sequences should become a progressively smaller fraction of the total memory usage as I use cloud compute instances with progressively larger amounts of RAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants