You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been trying to train a tokenizer following the code you provided, but I am encountering an out-of-memory issue. I'm working with a multi-species dataset that's several tens of GBs in size. Despite having 700GB of memory on my system, the training process for the BPE tokenizer consistently results in an out-of-memory error. Could you please share how you managed to train the BPE vocabulary on such a large multi-species dataset, plus the 1KG data? Any advice or insights would be greatly appreciated!
Thank you for your help and time.
The text was updated successfully, but these errors were encountered:
We also encountered OOM issue while training the tokenizer. To overcome this problem, we sampled 10 x 10^6 random subsequences from the whole dataset to train the tokenizer.
We also encountered OOM issue while training the tokenizer. To overcome this problem, we sampled 10 x 10^6 random subsequences from the whole dataset to train the tokenizer.
Hello!
I'm wondering how you segmented this complete data set? Are overlaps considered when dividing?
Hello,
I have been trying to train a tokenizer following the code you provided, but I am encountering an out-of-memory issue. I'm working with a multi-species dataset that's several tens of GBs in size. Despite having 700GB of memory on my system, the training process for the BPE tokenizer consistently results in an out-of-memory error. Could you please share how you managed to train the BPE vocabulary on such a large multi-species dataset, plus the 1KG data? Any advice or insights would be greatly appreciated!
Thank you for your help and time.
The text was updated successfully, but these errors were encountered: