-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minimizer/kmer string compression #107
Comments
Hi, ATCG is being represented only using 2 bits (00 is 0, 01 is 1, 10 is 2 and 11 is 3) |
Thanks Chirag,I also noticed this in that kc-c1.c why all veesus all is consuming so many memory?any possibility to reduce somehow if dna string is already compressed. Thanks Jianshu |
or we need to implement compression for fastANI? Thanks. Jianshu |
Hello Chirag, If there is no need to do string compression for fastANI, I will close this issue. Thanks, Jianshu |
Sorry Jianshu, I am not clear what string compression means in this context. FastANI maintains a k-mer database extracted from all genomes, that is subsequently queried during mapping stage. |
Hello Chirag,
Does fastANI compress kmer/minimizer strings by default? I did not see it after checking. I realized that kmer counting from Heng Li's repo (based on kseq.h) (https://github.com/lh3/kmer-cnt/blob/master/kc-c1.c) compress AGCT into 0,1,2,3 et.al. We could do better actually to represent AGCT using only 2 bits memory(00, 01, 10, 11), Since fastANI consumes a lot of memory when running all versus all, I am wondering this could save a lot of memory. There are several Rust libraries that compression kmer into 2 bits and save a lot of memory (https://github.com/jean-pierreBoth/kmerutils/blob/master/src/base/alphabet.rs). I noticed there is also one here for C++: https://github.com/dassencio/dna-compression
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: