You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am downloading SRA files with prefetch and then reading them in C++ by iterating over ngs::ReadCollection. All calculations are running in the Cloud on AWS. I have found a small number of seeingly "pathological" SRA runs that take longer to read as the iteration progresses through the file.
For example, the graph below shows the time required to read sequential, 0.1% chunks of ERR3212419 (where the x-axis is the cumulative number of reads read as a percentage of the total number of reads in the SRA run). As shown in the graph, the first 16% of reads can be read from disk relatively quickly (approximately 2 seconds per 0.1% chunk). However, the time to read the same number of reads then jumps to approximately 12 seconds, and then jumps again to over 100 seconds. (I stopped after loading 21% of the reads).
Is there a way to read this SRA record (and records like it), so that the time required to read different parts of the file is even? This is important because I would like to read SRA records (from disk) in parallel, and the uneven time-to-read makes for significant load imbalances. In this example, parallel workers reading near the beginning of the file finish much faster than the parallel worker reading near the end of the file.
The text was updated successfully, but these errors were encountered:
I am downloading SRA files with prefetch and then reading them in C++ by iterating over ngs::ReadCollection. All calculations are running in the Cloud on AWS. I have found a small number of seeingly "pathological" SRA runs that take longer to read as the iteration progresses through the file.
For example, the graph below shows the time required to read sequential, 0.1% chunks of ERR3212419 (where the x-axis is the cumulative number of reads read as a percentage of the total number of reads in the SRA run). As shown in the graph, the first 16% of reads can be read from disk relatively quickly (approximately 2 seconds per 0.1% chunk). However, the time to read the same number of reads then jumps to approximately 12 seconds, and then jumps again to over 100 seconds. (I stopped after loading 21% of the reads).
Is there a way to read this SRA record (and records like it), so that the time required to read different parts of the file is even? This is important because I would like to read SRA records (from disk) in parallel, and the uneven time-to-read makes for significant load imbalances. In this example, parallel workers reading near the beginning of the file finish much faster than the parallel worker reading near the end of the file.
The text was updated successfully, but these errors were encountered: