Tokenizers cpp 1251 #1379

gabe-l-hart · 2024-11-15T23:06:47Z

Dependencies

This PR depends on the python-layer support for Granite Code in Granite code support #1336

Description

This PR introduces a collection of new tokenizer capabilities at the C++ level. It is minimally complete at this point, so I'm opening up this PR as a point of discussion. The main goal is to add compatibility with the tokenizers library at the C++ level.

Additions

Dependencies

Add a third-party dependency on nlohmann/json to support parsing tokenizer.json files from HF tokenizers
Add a third-party dependency on gtest for unit tests
Add a vendored copy of the unicode.[h|cpp] and unicode-data.[h|cpp] from llama.cpp. These implement the full set of transformations needed to go to/from the byte-encoded space needed by the ByteLevel functionality.
- The files are under MIT license, so I copied the license text into a header and left a reference to the commit I used to copy them
- There is a possible inefficiency when interfacing with these methods since they use const std::string& rather than a string_view (re2::StringPiece), so we need to copy from the view to a string before calling the llama.cpp functions. This could likely be optimized by changing the function signatures in the vendored code, but I tried to avoid any changes to them at this point.
- I chose to put these directly into the source rather than add them to third_party since this was a slim subset of the full llama.cpp project.

Code Changes

Introduce the concept of a PreTokenizer and TokenDecoder aligned with the corresponding concepts in the Rust codebase
- These are not 1:1 ports, but rather a logical port to fit with the existing tokenizer class structure
Split up Tiktoken into a base implementation of BPE (BPETokenizerBase) with hooks for the specifics of the Tiktoken models
Add the HFTokenizer as a second derived BPETokenizerBase implementation which adds a PreTokenizer / TokenDecoder to the input/output logic
Added the tokenize_tool.cpp to sanity check tokenizers. This was mostly just for my own sanity, so it may not have much utility in the future, but I figured it may also be a useful development tool so left it in. It might make sense to move it somewhere else?

Build and Test

Add the tokenizer/tests dir to hold c++ unit tests
Add a common main.cpp for all gtest unit tests
Wire up testing in tokenizer/CMakeLists.txt so that it runs with ctest
- The test binaries are placed in bin/test and can be executed directly as well
Added reasonably complete unit tests for the pre_tokenizer stack, but nothing else (yet!)

Testing

Build setup

# Set up the build dir
cd /path/to/torchchat
mkdir -p build/tokenizer
cd build/tokenizer

# Configure cmake to build the unit tests and the tokenize tool
cmake ../../tokenizer -D BUILD_UNIT_TESTS=ON -D BUILD_TOKENIZE_TOOL=ON

# Build it
make -j

Unit testing

# (From build/tokenizer)

# Run all unit tests (currently just the one)
ctest

# Run the test binary directly
./bin/test/pre_tokenizer_test

Spot testing

I also did spot testing with several tokenizer models:

Granite Code: https://huggingface.co/ibm-granite/granite-3b-code-instruct-128k/blob/main/tokenizer.json
Llama 3.1 (tiktoken): https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/original/tokenizer.model

# Test with an HF tokenizer json directly
./bin/tokenize_tool hf_tokenizers $HOME/models/ibm-granite/granite-3b-code-instruct-128k/tokenizer.json This is a test 1234 5 foobar

# Test with an HF model repo that has tokenizer.json and tokenizer_config.json
./bin/tokenize_tool hf_tokenizers $HOME/models/ibm-granite/granite-3b-code-instruct-128k This is a test 1234 5 foobar

# Test with a tiktoken tokenizer
./bin/tokenize_tool tiktoken $HOME/models/meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.model This is a test 1234 5 foobar

Stil TODO

Add unit tests for the decoder suite
Add unit tests for the tokenizers themselves
Ensure that special tokens are handled correctly with the HFTokenizer (no more reliance on tiktoken hard-coded special tokens)

Future work

There are still a lot of PreTokenizers and TokenDecoders that are not yet supported. There's also the whole PostProcessors suite as well. A good test case for this would be the tokenizers version of the llama3.1 tokenizer (here)

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

* Use the right tokenizer_file name * Use the right transformer_params_key based on the file name in model_params * Use the updated name to indicate HF tokenizers Signed-off-by: Gabe Goodhart <[email protected]>

Something isn't quite working with this model yet, but the config should be accurate at this point. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

It was implicitly being pulled in via lm_eval -> transformers, but it's better to have it explicit since we use it directly Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

…HF tokenizers This is a much simplified version of the corresponding logic in transformers. I opted for this so that the full transformers dependency is not added here. CITE: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L1522 Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

This will allow the jinja2 templates for HF tokenizers to be applied without needing to hard-code the formatter logic. This will likely need to be duplicated in the embedded code version of chat. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

It was getting pulled in implicitly via flask and lm_eval -> transformers, but better to have it explicit. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

…er classes pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

…rings pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

…nizer_config.json We may still need to load the merges themselves pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

Not a terribly realistic usecase, but this avoids a corner case (that I just might be hitting while tokenizers is stubbed out!) pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

…ests pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

…ic BPE Committing this now to share it, but will likely rebase as I get back to this once I've handled the pre tokenizers better. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

…ma.cpp This is a much more efficient way to get this functionality working than a raw port. The original code caries MIT licensing, so the license is kept with a reference at the top of each file. This does introduce a bit of a redundancy in the regex support since the llama.cpp code relies on the STL versus RE2. This seems ok since it does not introduce an additional depencency, but a future optimization could be to refactor the llama.cpp code to leverage the (faster) RE2 implementation. The tradeoff would be a change in which regexes are supported. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

…rship Some pretokenizers mutate the data as it is split (in particular, the byte- level), so the returned set of pieces must have ownership over their data. This could potentially be a cost hit since those that do not require ownership will be making copies. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

This wraps around the llama.cpp regex splitter. I first attempted a full port of the rust code, but this accomplishes the same goal and is much more efficient than what I would have written. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

Best Practices: https://abseil.io/docs/cpp/guides/strings#string_view pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

This factory should be the primary mechanism for instantiating tokenizers from config json files pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

These was likely needed in the original implementation because of being called with different types for allowed_special, but here they're only ever used with an Encoder type so the template is unnecessary. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

This will allow HFTokenizer to reuse all of the BPE logic with different pre/post tokenization pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

We still need the byte-level decoder support, but this gets the encoding right in simple tests. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

This only supports ByteLevel at this point, so will need to be expanded to support additional types if/when other models need them. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

pytorch-bot · 2024-11-15T23:06:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1379

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

This comment was automatically generated by Dr. CI and updates every 15 minutes.

gabe-l-hart added 30 commits November 15, 2024 11:02

feat(models): Add models.json blocks for Granite Code 3b and 8b

9a18111

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

feat: Initial model params for granite code 3b

35da73e

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

fix(model config): Fix model configs for Granite Code

8b3a541

* Use the right tokenizer_file name * Use the right transformer_params_key based on the file name in model_params * Use the updated name to indicate HF tokenizers Signed-off-by: Gabe Goodhart <[email protected]>

feat(granite): Add model params for granite-code-8b

1b5a069

Something isn't quite working with this model yet, but the config should be accurate at this point. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

fix(deps): Add tokenizers to the deps explicitly

5c5b6f3

It was implicitly being pulled in via lm_eval -> transformers, but it's better to have it explicit since we use it directly Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

fix(deps): Add jinja2 as an explicit dep

1b6d63e

It was getting pulled in implicitly via flask and lm_eval -> transformers, but better to have it explicit. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>

refactor(tokenizer): Add const qualifiers to const methods in tokeniz…

1d2c9fe

…er classes pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenizers): Add submodule for nlohmann/json

82ede30

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenizer): Add stub of TokenizersTokenizer

2e5b95a

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenize_tool): Add a simple tool to round-trip tokenize some st…

6b5a48a

…rings pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenizers): Mostly complete impl of loading tokenizer.json/toke…

c1a0ba9

…nizer_config.json We may still need to load the merges themselves pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

fix(tokenize_tool): Fix logic for decoding if encoded is empty

b022b02

Not a terribly realistic usecase, but this avoids a corner case (that I just might be hitting while tokenizers is stubbed out!) pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenize)!: Add gtest submodule

10cae38

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenize): Initial work for pre-tokenizer class hierarchy with t…

6396b4e

…ests pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

chore(gitignore): Ignore the build dir (standard CMake build dir name)

a7871c2

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

test: Extract testing into a helper function

d25d3e6

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenizer): Add Digit pre tokenizer

7fbfb6b

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

fix(tokenizers): Use pass-by-value for string views

34bea83

Best Practices: https://abseil.io/docs/cpp/guides/strings#string_view pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenizers): add SequencePreTokenizer

f1663d5

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenizers): Add the PreTokenizerConfig factory class

941de94

This factory should be the primary mechanism for instantiating tokenizers from config json files pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

fix(tokenizers): Fix naming for inheriting from Tiktoken

6d3f47d

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

fix(tokenizer): TokenizersTokenizer -> HFTokenizer

927f226

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenizer): Split Tiktoken out into BPETokenizerBase and Tiktoken

01bacb4

This will allow HFTokenizer to reuse all of the BPE logic with different pre/post tokenization pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart added 5 commits November 15, 2024 11:02

feat(tokenizers): Add json parsing to PreTokenizerConfig factory

221811b

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenizers): Use BPETokenizerBase and PreTokenizers in HFTokenizer

9f898fe

We still need the byte-level decoder support, but this gets the encoding right in simple tests. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

fix(comment): Better comment in pre_tokenizer.h

7084831

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

feat(tokenizers): Add decoding support to HFTokenizer

b8c7941

pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 15, 2024

gabe-l-hart mentioned this pull request Nov 18, 2024

RFC: Code sharing for ET export, C++ runner and tokenizer, with ExecuTorch #1333

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizers cpp 1251 #1379

Tokenizers cpp 1251 #1379

gabe-l-hart commented Nov 15, 2024

pytorch-bot bot commented Nov 15, 2024

Tokenizers cpp 1251 #1379

Are you sure you want to change the base?

Tokenizers cpp 1251 #1379

Conversation

gabe-l-hart commented Nov 15, 2024

Dependencies

Description

Additions

Testing

Build setup

Unit testing

Spot testing

Stil TODO

Future work

pytorch-bot bot commented Nov 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1379

❗ 1 Active SEVs