You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! Thanks so much for sharing this repo. I have been following the GENA_LM project with interest and appreciate your provision of the code for training the Tokenizer.
I am particularly interested in the time you spent and the hardware environment you used during the training of this Tokenizer.
We used 10 x 10^6 random subsequences to train the tokenizer. We had about 7Gb of text data and needed 500Gb of RAM to train the tokenizer, and it took about 2h to train.
Not sure what you mean by data partitioning, but we followed this pipeline to generate data from the genome #3 (comment).
Hi! Thanks so much for sharing this repo. I have been following the GENA_LM project with interest and appreciate your provision of the code for training the Tokenizer.
I am particularly interested in the time you spent and the hardware environment you used during the training of this Tokenizer.
Additionally, I find that you partitioned the complete dataset during the Tokenizer training process. I am curious about the specifics of this dataset partitioning. Could you provide some details regarding the criteria for partitioning ?
The text was updated successfully, but these errors were encountered: