Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry Regarding Details on Training Tokenizer: Time, Hardware, and Dataset Splitting #17

Open
a-green-hand-jack opened this issue Mar 14, 2024 · 1 comment

Comments

@a-green-hand-jack
Copy link

Hi! Thanks so much for sharing this repo. I have been following the GENA_LM project with interest and appreciate your provision of the code for training the Tokenizer.

I am particularly interested in the time you spent and the hardware environment you used during the training of this Tokenizer.

Additionally, I find that you partitioned the complete dataset during the Tokenizer training process. I am curious about the specifics of this dataset partitioning. Could you provide some details regarding the criteria for partitioning ?

@yurakuratov
Copy link
Contributor

yurakuratov commented Oct 11, 2024

Hi!

We used 10 x 10^6 random subsequences to train the tokenizer. We had about 7Gb of text data and needed 500Gb of RAM to train the tokenizer, and it took about 2h to train.

Not sure what you mean by data partitioning, but we followed this pipeline to generate data from the genome #3 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants