Skip to content

Commit

Permalink
Merge branch 'jbarker/data_preprocessing_bug' into 'main'
Browse files Browse the repository at this point in the history
Fix off by one error in document preprocessing

See merge request ADLR/megatron-lm!705
  • Loading branch information
jon-barker committed Aug 5, 2023
2 parents a4ad305 + 788af6f commit 0609f27
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions tools/preprocess_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ def encode(self, json_line):
sentence_lens.append(len(sentence_ids))
if len(doc_ids) > 0 and self.args.append_eod:
doc_ids.append(Encoder.tokenizer.eod)
sentence_lens[-1] += 1
ids[key] = doc_ids
lens[key] = sentence_lens
return ids, lens, len(json_line)
Expand Down

0 comments on commit 0609f27

Please sign in to comment.