Inquiry Regarding Details on Training Tokenizer: Time, Hardware, and Dataset Splitting #17

a-green-hand-jack · 2024-03-14T08:35:41Z

Hi! Thanks so much for sharing this repo. I have been following the GENA_LM project with interest and appreciate your provision of the code for training the Tokenizer.

I am particularly interested in the time you spent and the hardware environment you used during the training of this Tokenizer.

Additionally, I find that you partitioned the complete dataset during the Tokenizer training process. I am curious about the specifics of this dataset partitioning. Could you provide some details regarding the criteria for partitioning ?

yurakuratov · 2024-10-11T16:15:04Z

Hi!

We used 10 x 10^6 random subsequences to train the tokenizer. We had about 7Gb of text data and needed 500Gb of RAM to train the tokenizer, and it took about 2h to train.

Not sure what you mean by data partitioning, but we followed this pipeline to generate data from the genome #3 (comment).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry Regarding Details on Training Tokenizer: Time, Hardware, and Dataset Splitting #17

Inquiry Regarding Details on Training Tokenizer: Time, Hardware, and Dataset Splitting #17

a-green-hand-jack commented Mar 14, 2024

yurakuratov commented Oct 11, 2024 •

edited

Loading

Inquiry Regarding Details on Training Tokenizer: Time, Hardware, and Dataset Splitting #17

Inquiry Regarding Details on Training Tokenizer: Time, Hardware, and Dataset Splitting #17

Comments

a-green-hand-jack commented Mar 14, 2024

yurakuratov commented Oct 11, 2024 • edited Loading

yurakuratov commented Oct 11, 2024 •

edited

Loading