Reduce maximum memory usage while reading SRA data? #97

jgans · 2023-03-28T01:28:31Z

Using the C VDB API (and following the fasterq-dump utility strategy for accessing SRA records) for reading SRA data can consume a significant amount of RAM while reading an SRA record. This can be an issue when using attempting to minimize the amount of Cloud computing resources (i.e. instance RAM) when processing a large number of SRA records.

The maximum amount of RAM used while reading (as measured with /usr/bin/time -v) depends on the record:

While periodically calling VCursorRelease() and VCursorOpen() to force the VDB interface to deallocate RAM offers a minor reduction in the maximum amount of RAM used (about 25%), this strategy significantly slows down the rate at which an SRA record is read.

Is it possible/feasible to limit memory consumption using the VDB C API to sub-gigabyte levels, independent of the number of reads? The goal is to read through an SRA record once, as quickly as possible and using as little RAM as possible.

The text was updated successfully, but these errors were encountered:

durbrow · 2023-03-28T14:14:14Z

Possible? Of course. Feasible? Not so much.

Most likely the memory is being used for caching the reference sequences. If you are using an access pattern like fasterq-dump, you will have more reference sequences loaded and actually being used.

The problem is thus:
The alignment records are stored in reference position order. The sequence, quality, and mate pairing information is stored in the order that the mate pairing information was completed. Generally, the mate pairs are only a "short" distance apart, but outliers are usually frequent enough (and randomly so) to blow out our caches.

If you don't care about pairing mates (immediately), you can employ a strategy of extract-and-sort, like fasterq-dump. If you don't care about preserving the order, you can regroup instead of sort. If you don't care about having the mates together, you can skip regrouping. If you don't care about quality scores (and you really shouldn't), you can make the above even faster, e.g. by using a data file without quality scores and by not extracting any quality scores. If you need even finer control, you would need to perform the decompression of the aligned sequences yourself, instead of having VDB's transform engine do it. If you are interested, I can give you more details.

jgans · 2023-03-28T16:31:24Z

Thank you for the advice! From the description you provided, it appears that the maximum amount of RAM memory used by VDB should be proportional to the total size of the reference sequences, as opposed to the total size of the reads in an SRA record. Since the amount of RAM I'm using to process an SRA record is already proportional to the total amount of read sequence, the amount of RAM used by VDB for caching reference sequences should become a progressively smaller fraction of the total memory usage as I use cloud compute instances with progressively larger amounts of RAM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce maximum memory usage while reading SRA data? #97

Reduce maximum memory usage while reading SRA data? #97

jgans commented Mar 28, 2023

durbrow commented Mar 28, 2023 •

edited

Loading

jgans commented Mar 28, 2023

Reduce maximum memory usage while reading SRA data? #97

Reduce maximum memory usage while reading SRA data? #97

Comments

jgans commented Mar 28, 2023

durbrow commented Mar 28, 2023 • edited Loading

jgans commented Mar 28, 2023

durbrow commented Mar 28, 2023 •

edited

Loading