Newbie Q: does the training data (train_ids) has to be consecutive? can I inject -1 as the integer marker id into train_ids? #374

mw66 · 2023-09-18T23:09:21Z

Hi,

I'm thinking about adding some special END OF TEXT token to my data (to separate different articles), e.g:

#244

I checked here:

nanoGPT/data/shakespeare_char/prepare.py

Line 51 in eba36e8

train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))

and I'm wondering if the training data (train_ids) has to be consecutive?

E.g. can I use np.iinfo(np.uint16).max as my special marker token? to avoid conflict with any new future dataset as much as possible.

Thanks.

The text was updated successfully, but these errors were encountered:

VatsaDev · 2023-09-19T19:38:22Z

Here's a simple way to put this.

Of course the batches will cross into different bits of your training data, and I've found that literally random order is better.

If you have a text file, train/val Data won't really be uniform, especially with the NanoGPT approach, which involves just taking the first 90% of Data into train and the last 10% into val. This made train loss relatively high, ~2, and Val loss extremely high, ~5, as my last 10% wasn't similar to the first 90%.

My solution was to random load the txt file data in chunks so that is was different, but still similar. in terms of an end of text token, just go for something like <eot> not code, it's easier to place in, more readable than random bits of code everywhere. As long as you put the token in the text after every article, the model should pick up on and use it. then you can just truncate after <eot>

You can look at my prepare.py

mw66 · 2023-09-19T20:18:02Z

Thanks. My question is more about what values can be used as special markers (without changing the original input text), e.g. when I prepare data, can I inject -1 as the integer marker id? (and save as np.int16 instead of np.uint16 of course).

I also asked a related question here:

karpathy/minGPT#123

VatsaDev · 2023-09-24T16:44:20Z

To my understanding, we don't add negative values to the tokenizer, we just extend vocab, like this:

# gpt-2 encodings
print("loading GPT-2 encodings...")
enc = tiktoken.get_encoding("gpt2")
encode = lambda s: enc.encode(s, allowed_special={"<endOfText>","<bot>","<human>","<system>"})
decode = lambda l: enc.decode(l)

this just add 4 extra tokenizer tokens to the already ~50000 token vocab
you probably could have a negative tokenizer value(a [-1] token), but you would have to customize tiktoken for that, and adding negative value to the tokenizer means you now have to account for a greater fixed size integer set, which I think would make it slower.

tldr: its possible, but people don't really need negative tokens, its just extra work/slower
putting the same thing on mingpt repo

mw66 changed the title ~~Newbie Q: does the training data (train_ids) has to be consecutive?~~ Newbie Q: does the training data (train_ids) has to be consecutive? can I inject -1 as the integer marker id into train_ids? Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newbie Q: does the training data (train_ids) has to be consecutive? can I inject -1 as the integer marker id into train_ids? #374

Newbie Q: does the training data (train_ids) has to be consecutive? can I inject -1 as the integer marker id into train_ids? #374

mw66 commented Sep 18, 2023 •

edited

Loading

VatsaDev commented Sep 19, 2023 •

edited

Loading

mw66 commented Sep 19, 2023 •

edited

Loading

VatsaDev commented Sep 24, 2023 •

edited

Loading

Newbie Q: does the training data (train_ids) has to be consecutive? can I inject -1 as the integer marker id into train_ids? #374

Newbie Q: does the training data (train_ids) has to be consecutive? can I inject -1 as the integer marker id into train_ids? #374

Comments

mw66 commented Sep 18, 2023 • edited Loading

VatsaDev commented Sep 19, 2023 • edited Loading

mw66 commented Sep 19, 2023 • edited Loading

VatsaDev commented Sep 24, 2023 • edited Loading

mw66 commented Sep 18, 2023 •

edited

Loading

VatsaDev commented Sep 19, 2023 •

edited

Loading

mw66 commented Sep 19, 2023 •

edited

Loading

VatsaDev commented Sep 24, 2023 •

edited

Loading