-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lowering (or limiting) memory consumption for fastq-dump #424
Comments
SRR6399464 is aligned-to-human data; at the very least, a dumper will need to pull in the human genome to reconstruct the reads. Whether to disk or to RAM, it must be put somewhere. Increasingly our tools put it in RAM and let the OS sort it out. FYI, SRR6399464 is a 10GB sra file with about 151 million reads (of 151 bases each). Generally, we don't recommend the use of |
I see, thanks! I've seen datasets taking 9GB+ of memory, but I don't have one running at the moment to report. I've been using this query to select metagenomes from the SRA. Do you have any suggestions on how to figure out when a dataset needs to pull a genome to recovery reads from an alignment?
Hmm, but the alternatives ( Thanks! |
It depends on how much work you have to do and how much you want to put into setting it up. Generally the strategies all depend on intelligently batching things and amortizing the costs across batches of similar sets. For example, you would divide the aligned ones from the unaligned ones and start processing the unaligned ones using e.g. fastq-dump. Divide the aligned ones into sets depending on the reference sequences they were aligned to, prefetch the set of references, and sam-dump or fasterq-dump. And if you REALLY need to process a huge ton of SRA data (some people do!) and you want speed and frugality and are willing to get your hands dirty with some programming, we have API level ways to access the data faster and more precisely than the command line tools can provide. |
Good suggestions, thanks! I like the current simplicity of just receiving an SRA Run ID and processing, but if that's the only option I can adapt and split in different queues and groups.
Hmm, I do want to process a ton =] Do you have pointers to how to use the API to process data? Are the py-vdb examples viable, or are they too outdated? Should I go for the C++ API? |
Are you running pipelines in cloud or on local hosts?
…-Chris
From: "Luiz Irber" <[email protected]<mailto:[email protected]>>
Date: Thursday, October 1, 2020 at 4:33:24 PM
To: "ncbi/sra-tools" <[email protected]<mailto:[email protected]>>
Cc: "Subscribed" <[email protected]<mailto:[email protected]>>
Subject: Re: [ncbi/sra-tools] Lowering (or limiting) memory consumption for fastq-dump (#424)
SRR6399464 is aligned-to-human data; at the very least, a dumper will need to pull in the human genome to reconstruct the reads. Whether to disk or to RAM, it must be put somewhere. Increasingly our tools put it in RAM and let the OS sort it out. FYI, SRR6399464 is a 10GB sra file with about 151 million reads (of 151 bases each).
I see, thanks! I've seen datasets taking 9GB+ of memory, but I don't have one running at the moment to report.
I've been using this query<https://www.ncbi.nlm.nih.gov/sra/?term=%22METAGENOMIC%22%5Bsource%5D+NOT+amplicon%5BAll+Fields%5D> to select metagenomes from the SRA. Do you have any suggestions on how to figure out when a dataset needs to pull a genome to recovery reads from an alignment?
Generally, we don't recommend the use of fastq-dump for aligned data. It is our oldest tool, by and large written before aligned data sets were being submitted to SRA.
Hmm, but the alternatives (prefetch, fasterq-dump) require temporary disk space, which increase the management I have to do in pipelines to clean it up after it is used (I don't need the data after I processed it once). The processing I do is lightweight in memory and CPU, but I can't run them in smaller machines because it either requires too much memory (fastq-dump) or more temporary storage (prefetch, fasterq-dump). Do you have any suggestion for downloading SRA datasets in resource-frugal instances?
Thanks!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#424 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AC66ETI7BGAMVPHWYDMQLRLSITRQPANCNFSM4SAYAWFQ>.
|
Currently using HPCs (and any resource I can put my hands on =]), I tried moving to the cloud before but resource consumption with |
The python API is maintained, but I don't know about the examples. We use the python bindings for making tests, so we do keep it up-to-date and add stuff to it as needed. We know someone who is using our APIs to process pretty much the entirety of SRA. ncbi/ngs#27 (comment) Although he is more interested in flat-out speed, he may have some useful experience to share with you. |
Nice! I started a new recipe (complementing ncbi-vdb) to expose the Python API:
Oh, cool! I will take a look. |
If you choose to go the python route, you might want to get familiar with our One caveat about |
Those are great tips, thanks! I used L10-fastq.py as an start for a script that reads and outputs sequences (it doesn't even check read name, because I don't need it) here, and I got the same results for
I've seen in some issue before (but don't remember which...) that I also checked ncbi/ngs#27 (comment) and it seems I can use Thanks! |
I would also recommend checking out ncbi/ncbi-vdb/issues/31, as that post sketches out how to read aligned reads from an SRA file with reduced memory consumption and increased speed. Here is a C++ code snippet that illustrates the approach in more detail. Beware that this code does not extract partially aligned read pairs (i.e. when one read is aligned and the other read is unaligned) and only returns the reads in the order returned by the SRA toolkit API.
|
@luizirber, do you still need help? |
Hello,
I've been seeing large memory consumption (1467 MB for SRR6399464, for example) when downloading data using the following command:
fastq-dump --disable-multithreading --fasta 0 --skip-technical --readids --read-filter pass --dumpbase --split-spot --clip -Z SRA_ID
Do you have any tips on how to lower it, or limit it for long-running processes?
For my use case, i don't need paired reads (I need the reads content, but they don't need to be paired), and they don't need to be in any specific order. I'm also streaming the data to another process, and would like to minimize temporary disk space usage (so I'm not using
fasterq-dump
on purpose).Thanks,
Luiz
The text was updated successfully, but these errors were encountered: