Uneven reading time within a SRA file #29

jgans · 2020-06-18T18:25:10Z

I am downloading SRA files with prefetch and then reading them in C++ by iterating over ngs::ReadCollection. All calculations are running in the Cloud on AWS. I have found a small number of seeingly "pathological" SRA runs that take longer to read as the iteration progresses through the file.

For example, the graph below shows the time required to read sequential, 0.1% chunks of ERR3212419 (where the x-axis is the cumulative number of reads read as a percentage of the total number of reads in the SRA run). As shown in the graph, the first 16% of reads can be read from disk relatively quickly (approximately 2 seconds per 0.1% chunk). However, the time to read the same number of reads then jumps to approximately 12 seconds, and then jumps again to over 100 seconds. (I stopped after loading 21% of the reads).

Is there a way to read this SRA record (and records like it), so that the time required to read different parts of the file is even? This is important because I would like to read SRA records (from disk) in parallel, and the uneven time-to-read makes for significant load imbalances. In this example, parallel workers reading near the beginning of the file finish much faster than the parallel worker reading near the end of the file.

kwrodarmer · 2020-06-18T18:29:22Z

Thank you for the detailed report!

Let us examine it before responding. This will not be instantaneous, but as quickly as we can.

Again, really sincere thanks for such great information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uneven reading time within a SRA file #29

Uneven reading time within a SRA file #29

jgans commented Jun 18, 2020

kwrodarmer commented Jun 18, 2020

Uneven reading time within a SRA file #29

Uneven reading time within a SRA file #29

Comments

jgans commented Jun 18, 2020

kwrodarmer commented Jun 18, 2020