Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newbie Q: does the training data (train_ids) has to be consecutive? can I inject -1 as the integer marker id into train_ids? #374

Open
mw66 opened this issue Sep 18, 2023 · 3 comments

Comments

@mw66
Copy link

mw66 commented Sep 18, 2023

Hi,

I'm thinking about adding some special END OF TEXT token to my data (to separate different articles), e.g:

#244

I checked here:

train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))

and I'm wondering if the training data (train_ids) has to be consecutive?

E.g. can I use np.iinfo(np.uint16).max as my special marker token? to avoid conflict with any new future dataset as much as possible.

Thanks.

@VatsaDev
Copy link

VatsaDev commented Sep 19, 2023

Here's a simple way to put this.

Of course the batches will cross into different bits of your training data, and I've found that literally random order is better.

If you have a text file, train/val Data won't really be uniform, especially with the NanoGPT approach, which involves just taking the first 90% of Data into train and the last 10% into val. This made train loss relatively high, ~2, and Val loss extremely high, ~5, as my last 10% wasn't similar to the first 90%.

My solution was to random load the txt file data in chunks so that is was different, but still similar. in terms of an end of text token, just go for something like <eot> not code, it's easier to place in, more readable than random bits of code everywhere. As long as you put the token in the text after every article, the model should pick up on and use it. then you can just truncate after <eot>

You can look at my prepare.py

@mw66
Copy link
Author

mw66 commented Sep 19, 2023

Thanks. My question is more about what values can be used as special markers (without changing the original input text), e.g. when I prepare data, can I inject -1 as the integer marker id? (and save as np.int16 instead of np.uint16 of course).

I also asked a related question here:

karpathy/minGPT#123

@mw66 mw66 changed the title Newbie Q: does the training data (train_ids) has to be consecutive? Newbie Q: does the training data (train_ids) has to be consecutive? can I inject -1 as the integer marker id into train_ids? Sep 19, 2023
@VatsaDev
Copy link

VatsaDev commented Sep 24, 2023

To my understanding, we don't add negative values to the tokenizer, we just extend vocab, like this:

# gpt-2 encodings
print("loading GPT-2 encodings...")
enc = tiktoken.get_encoding("gpt2")
encode = lambda s: enc.encode(s, allowed_special={"<endOfText>","<bot>","<human>","<system>"})
decode = lambda l: enc.decode(l)

this just add 4 extra tokenizer tokens to the already ~50000 token vocab
you probably could have a negative tokenizer value(a [-1] token), but you would have to customize tiktoken for that, and adding negative value to the tokenizer means you now have to account for a greater fixed size integer set, which I think would make it slower.

tldr: its possible, but people don't really need negative tokens, its just extra work/slower
putting the same thing on mingpt repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants