Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory Issue During BPE Tokenizer Training with Large Multi-Species Dataset #14

Open
luoshengtangxiademao opened this issue Dec 3, 2023 · 3 comments

Comments

@luoshengtangxiademao
Copy link

Hello,

I have been trying to train a tokenizer following the code you provided, but I am encountering an out-of-memory issue. I'm working with a multi-species dataset that's several tens of GBs in size. Despite having 700GB of memory on my system, the training process for the BPE tokenizer consistently results in an out-of-memory error. Could you please share how you managed to train the BPE vocabulary on such a large multi-species dataset, plus the 1KG data? Any advice or insights would be greatly appreciated!

Thank you for your help and time.

@yurakuratov
Copy link
Contributor

Hi!

We also encountered OOM issue while training the tokenizer. To overcome this problem, we sampled 10 x 10^6 random subsequences from the whole dataset to train the tokenizer.

@a-green-hand-jack
Copy link

Hi!

We also encountered OOM issue while training the tokenizer. To overcome this problem, we sampled 10 x 10^6 random subsequences from the whole dataset to train the tokenizer.

Hello!
I'm wondering how you segmented this complete data set? Are overlaps considered when dividing?

@yurakuratov
Copy link
Contributor

We followed BigBird's data pipeline, so yes, sequences could overlap during sampling from genomic data and during subsampling for tokenization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants