Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizers cpp 1251 #1379

Draft
wants to merge 35 commits into
base: main
Choose a base branch
from
Draft

Conversation

gabe-l-hart
Copy link
Contributor

Dependencies

Description

This PR introduces a collection of new tokenizer capabilities at the C++ level. It is minimally complete at this point, so I'm opening up this PR as a point of discussion. The main goal is to add compatibility with the tokenizers library at the C++ level.

Additions

Dependencies

  • Add a third-party dependency on nlohmann/json to support parsing tokenizer.json files from HF tokenizers
  • Add a third-party dependency on gtest for unit tests
  • Add a vendored copy of the unicode.[h|cpp] and unicode-data.[h|cpp] from llama.cpp. These implement the full set of transformations needed to go to/from the byte-encoded space needed by the ByteLevel functionality.
    • The files are under MIT license, so I copied the license text into a header and left a reference to the commit I used to copy them
    • There is a possible inefficiency when interfacing with these methods since they use const std::string& rather than a string_view (re2::StringPiece), so we need to copy from the view to a string before calling the llama.cpp functions. This could likely be optimized by changing the function signatures in the vendored code, but I tried to avoid any changes to them at this point.
    • I chose to put these directly into the source rather than add them to third_party since this was a slim subset of the full llama.cpp project.

Code Changes

  • Introduce the concept of a PreTokenizer and TokenDecoder aligned with the corresponding concepts in the Rust codebase
    • These are not 1:1 ports, but rather a logical port to fit with the existing tokenizer class structure
  • Split up Tiktoken into a base implementation of BPE (BPETokenizerBase) with hooks for the specifics of the Tiktoken models
  • Add the HFTokenizer as a second derived BPETokenizerBase implementation which adds a PreTokenizer / TokenDecoder to the input/output logic
  • Added the tokenize_tool.cpp to sanity check tokenizers. This was mostly just for my own sanity, so it may not have much utility in the future, but I figured it may also be a useful development tool so left it in. It might make sense to move it somewhere else?

Build and Test

  • Add the tokenizer/tests dir to hold c++ unit tests
  • Add a common main.cpp for all gtest unit tests
  • Wire up testing in tokenizer/CMakeLists.txt so that it runs with ctest
    • The test binaries are placed in bin/test and can be executed directly as well
  • Added reasonably complete unit tests for the pre_tokenizer stack, but nothing else (yet!)

Testing

Build setup

# Set up the build dir
cd /path/to/torchchat
mkdir -p build/tokenizer
cd build/tokenizer

# Configure cmake to build the unit tests and the tokenize tool
cmake ../../tokenizer -D BUILD_UNIT_TESTS=ON -D BUILD_TOKENIZE_TOOL=ON

# Build it
make -j

Unit testing

# (From build/tokenizer)

# Run all unit tests (currently just the one)
ctest

# Run the test binary directly
./bin/test/pre_tokenizer_test

Spot testing

I also did spot testing with several tokenizer models:

# Test with an HF tokenizer json directly
./bin/tokenize_tool hf_tokenizers $HOME/models/ibm-granite/granite-3b-code-instruct-128k/tokenizer.json This is a test 1234 5 foobar

# Test with an HF model repo that has tokenizer.json and tokenizer_config.json
./bin/tokenize_tool hf_tokenizers $HOME/models/ibm-granite/granite-3b-code-instruct-128k This is a test 1234 5 foobar

# Test with a tiktoken tokenizer
./bin/tokenize_tool tiktoken $HOME/models/meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.model This is a test 1234 5 foobar

Stil TODO

  • Add unit tests for the decoder suite
  • Add unit tests for the tokenizers themselves
  • Ensure that special tokens are handled correctly with the HFTokenizer (no more reliance on tiktoken hard-coded special tokens)

Future work

  • There are still a lot of PreTokenizers and TokenDecoders that are not yet supported. There's also the whole PostProcessors suite as well. A good test case for this would be the tokenizers version of the llama3.1 tokenizer (here)

Branch: GraniteCodeSupport

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteCodeSupport

Signed-off-by: Gabe Goodhart <[email protected]>
* Use the right tokenizer_file name
* Use the right transformer_params_key based on the file name in
model_params
* Use the updated name to indicate HF tokenizers

Signed-off-by: Gabe Goodhart <[email protected]>
Something isn't quite working with this model yet, but the config should be
accurate at this point.

Branch: GraniteCodeSupport

Signed-off-by: Gabe Goodhart <[email protected]>
It was implicitly being pulled in via lm_eval -> transformers, but it's
better to have it explicit since we use it directly

Branch: GraniteCodeSupport

Signed-off-by: Gabe Goodhart <[email protected]>
…HF tokenizers

This is a much simplified version of the corresponding logic in
transformers. I opted for this so that the full transformers dependency is
not added here.

CITE: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L1522

Branch: GraniteCodeSupport

Signed-off-by: Gabe Goodhart <[email protected]>
This will allow the jinja2 templates for HF tokenizers to be applied
without needing to hard-code the formatter logic. This will likely need to
be duplicated in the embedded code version of chat.

Branch: GraniteCodeSupport

Signed-off-by: Gabe Goodhart <[email protected]>
It was getting pulled in implicitly via flask and lm_eval -> transformers,
but better to have it explicit.

Branch: GraniteCodeSupport

Signed-off-by: Gabe Goodhart <[email protected]>
…er classes

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
…rings

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
…nizer_config.json

We may still need to load the merges themselves

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
Not a terribly realistic usecase, but this avoids a corner case (that I
just might be hitting while tokenizers is stubbed out!)

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
…ests

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
…ic BPE

Committing this now to share it, but will likely rebase as I get back to
this once I've handled the pre tokenizers better.

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
…ma.cpp

This is a much more efficient way to get this functionality working than a
raw port. The original code caries MIT licensing, so the license is kept
with a reference at the top of each file.

This does introduce a bit of a redundancy in the regex support since the
llama.cpp code relies on the STL versus RE2. This seems ok since it does
not introduce an additional depencency, but a future optimization could be
to refactor the llama.cpp code to leverage the (faster) RE2 implementation.
The tradeoff would be a change in which regexes are supported.

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
…rship

Some pretokenizers mutate the data as it is split (in particular, the byte-
level), so the returned set of pieces must have ownership over their data.
This could potentially be a cost hit since those that do not require
ownership will be making copies.

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
This wraps around the llama.cpp regex splitter. I first attempted a full
port of the rust code, but this accomplishes the same goal and is much
more efficient than what I would have written.

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
This factory should be the primary mechanism for instantiating tokenizers
from config json files

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
These was likely needed in the original implementation because of being
called with different types for allowed_special, but here they're only ever
used with an Encoder type so the template is unnecessary.

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
This will allow HFTokenizer to reuse all of the BPE logic with different
pre/post tokenization

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
We still need the byte-level decoder support, but this gets the encoding
right in simple tests.

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
This only supports ByteLevel at this point, so will need to be expanded to
support additional types if/when other models need them.

pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251
Branch: TokenizersCpp-1251

Signed-off-by: Gabe Goodhart <[email protected]>
Copy link

pytorch-bot bot commented Nov 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1379

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants