Out of Memory Issue During BPE Tokenizer Training with Large Multi-Species Dataset #14

luoshengtangxiademao · 2023-12-03T10:23:53Z

Hello,

I have been trying to train a tokenizer following the code you provided, but I am encountering an out-of-memory issue. I'm working with a multi-species dataset that's several tens of GBs in size. Despite having 700GB of memory on my system, the training process for the BPE tokenizer consistently results in an out-of-memory error. Could you please share how you managed to train the BPE vocabulary on such a large multi-species dataset, plus the 1KG data? Any advice or insights would be greatly appreciated!

Thank you for your help and time.

yurakuratov · 2024-02-12T10:09:15Z

Hi!

We also encountered OOM issue while training the tokenizer. To overcome this problem, we sampled 10 x 10^6 random subsequences from the whole dataset to train the tokenizer.

a-green-hand-jack · 2024-03-14T08:24:27Z

Hi!

We also encountered OOM issue while training the tokenizer. To overcome this problem, we sampled 10 x 10^6 random subsequences from the whole dataset to train the tokenizer.

Hello!
I'm wondering how you segmented this complete data set? Are overlaps considered when dividing?

yurakuratov · 2024-10-11T16:08:07Z

We followed BigBird's data pipeline, so yes, sequences could overlap during sampling from genomic data and during subsampling for tokenization.

a-green-hand-jack mentioned this issue Mar 14, 2024

Inquiry Regarding Details on Training Tokenizer: Time, Hardware, and Dataset Splitting #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of Memory Issue During BPE Tokenizer Training with Large Multi-Species Dataset #14

Out of Memory Issue During BPE Tokenizer Training with Large Multi-Species Dataset #14

luoshengtangxiademao commented Dec 3, 2023

yurakuratov commented Feb 12, 2024

a-green-hand-jack commented Mar 14, 2024

yurakuratov commented Oct 11, 2024

Out of Memory Issue During BPE Tokenizer Training with Large Multi-Species Dataset #14

Out of Memory Issue During BPE Tokenizer Training with Large Multi-Species Dataset #14

Comments

luoshengtangxiademao commented Dec 3, 2023

yurakuratov commented Feb 12, 2024

a-green-hand-jack commented Mar 14, 2024

yurakuratov commented Oct 11, 2024