-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Newbie Q: does the training data (train_ids) has to be consecutive? can I inject -1 as the integer marker id into train_ids? #374
Comments
Here's a simple way to put this. Of course the batches will cross into different bits of your training data, and I've found that literally random order is better. If you have a text file, train/val Data won't really be uniform, especially with the NanoGPT approach, which involves just taking the first 90% of Data into train and the last 10% into val. This made train loss relatively high, ~2, and Val loss extremely high, ~5, as my last 10% wasn't similar to the first 90%. My solution was to random load the txt file data in chunks so that is was different, but still similar. in terms of an end of text token, just go for something like You can look at my prepare.py |
Thanks. My question is more about what values can be used as special markers (without changing the original input text), e.g. when I prepare data, can I inject I also asked a related question here: |
To my understanding, we don't add negative values to the tokenizer, we just extend vocab, like this: # gpt-2 encodings
print("loading GPT-2 encodings...")
enc = tiktoken.get_encoding("gpt2")
encode = lambda s: enc.encode(s, allowed_special={"<endOfText>","<bot>","<human>","<system>"})
decode = lambda l: enc.decode(l) this just add 4 extra tokenizer tokens to the already ~50000 token vocab tldr: its possible, but people don't really need negative tokens, its just extra work/slower |
Hi,
I'm thinking about adding some special END OF TEXT token to my data (to separate different articles), e.g:
#244
I checked here:
nanoGPT/data/shakespeare_char/prepare.py
Line 51 in eba36e8
and I'm wondering if the training data (train_ids) has to be consecutive?
E.g. can I use
np.iinfo(np.uint16).max
as my special marker token? to avoid conflict with any new future dataset as much as possible.Thanks.
The text was updated successfully, but these errors were encountered: