-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunk size, kmers or seqs? #12
Comments
I like that |
Here is a proposed design. The user will set a maximum memory, say, 1GB. The chunk size will be auto-calculated Kmers are stored in the following data structure. kmerDecoder/include/kmerDecoder.hpp Line 34 in df39cfd
When implementing this design, we will need to attach the sequence name to every kmer, increasing the memory. Why is that? Because a sequence's kmers will most likely split over two chunks. Or we can set another data structure to hold this information without redundancy. This design will significantly change the kmerDecoder API, which means it needs to be changed in every part used in kProcessor. So, I think we are going to defer this for now |
The memory based approach is nice. |
If the chunk size is set to be small, and the sequences file is large, it will require many writing times to the temp file. I will rethink for an alternative design and post it here after implementing the aa-encoding #14 . |
Rethinking regarding the chunk size, should we define the chunk size as the number of sequences or the number of kmers?
Chunk size as the number of sequences should work when the sequence lengths are relatively small. In genomes for example, if we set the chunk size to
10k
that will consume a lot of memory per single chunk. On the other hand, it will work smoothly when processing transcripts due to their short and the average length is small.Chunk size as the number of kmers will work just fine on the previous examples and we can set a fixed multiplier of thousands or millions.
@drtamermansour what do you think?
The text was updated successfully, but these errors were encountered: