-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizers cpp 1251 #1379
Draft
gabe-l-hart
wants to merge
35
commits into
pytorch:main
Choose a base branch
from
gabe-l-hart:TokenizersCpp-1251
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Tokenizers cpp 1251 #1379
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
* Use the right tokenizer_file name * Use the right transformer_params_key based on the file name in model_params * Use the updated name to indicate HF tokenizers Signed-off-by: Gabe Goodhart <[email protected]>
Something isn't quite working with this model yet, but the config should be accurate at this point. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
It was implicitly being pulled in via lm_eval -> transformers, but it's better to have it explicit since we use it directly Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
…HF tokenizers This is a much simplified version of the corresponding logic in transformers. I opted for this so that the full transformers dependency is not added here. CITE: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L1522 Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
This will allow the jinja2 templates for HF tokenizers to be applied without needing to hard-code the formatter logic. This will likely need to be duplicated in the embedded code version of chat. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
It was getting pulled in implicitly via flask and lm_eval -> transformers, but better to have it explicit. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
…er classes pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…rings pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…nizer_config.json We may still need to load the merges themselves pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
Not a terribly realistic usecase, but this avoids a corner case (that I just might be hitting while tokenizers is stubbed out!) pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…ests pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…ic BPE Committing this now to share it, but will likely rebase as I get back to this once I've handled the pre tokenizers better. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…ma.cpp This is a much more efficient way to get this functionality working than a raw port. The original code caries MIT licensing, so the license is kept with a reference at the top of each file. This does introduce a bit of a redundancy in the regex support since the llama.cpp code relies on the STL versus RE2. This seems ok since it does not introduce an additional depencency, but a future optimization could be to refactor the llama.cpp code to leverage the (faster) RE2 implementation. The tradeoff would be a change in which regexes are supported. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…rship Some pretokenizers mutate the data as it is split (in particular, the byte- level), so the returned set of pieces must have ownership over their data. This could potentially be a cost hit since those that do not require ownership will be making copies. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
This wraps around the llama.cpp regex splitter. I first attempted a full port of the rust code, but this accomplishes the same goal and is much more efficient than what I would have written. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
Best Practices: https://abseil.io/docs/cpp/guides/strings#string_view pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
This factory should be the primary mechanism for instantiating tokenizers from config json files pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
These was likely needed in the original implementation because of being called with different types for allowed_special, but here they're only ever used with an Encoder type so the template is unnecessary. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
This will allow HFTokenizer to reuse all of the BPE logic with different pre/post tokenization pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
We still need the byte-level decoder support, but this gets the encoding right in simple tests. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
This only supports ByteLevel at this point, so will need to be expanded to support additional types if/when other models need them. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1379
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
facebook-github-bot
added
the
CLA Signed
This label is managed by the Meta Open Source bot.
label
Nov 15, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Dependencies
Description
This PR introduces a collection of new
tokenizer
capabilities at the C++ level. It is minimally complete at this point, so I'm opening up this PR as a point of discussion. The main goal is to add compatibility with the tokenizers library at the C++ level.Additions
Dependencies
tokenizer.json
files from HF tokenizersunicode.[h|cpp]
andunicode-data.[h|cpp]
fromllama.cpp
. These implement the full set of transformations needed to go to/from the byte-encoded space needed by theByteLevel
functionality.const std::string&
rather than astring_view
(re2::StringPiece
), so we need to copy from the view to a string before calling thellama.cpp
functions. This could likely be optimized by changing the function signatures in the vendored code, but I tried to avoid any changes to them at this point.third_party
since this was a slim subset of the fullllama.cpp
project.Code Changes
PreTokenizer
andTokenDecoder
aligned with the corresponding concepts in the Rust codebaseTiktoken
into a base implementation of BPE (BPETokenizerBase
) with hooks for the specifics of theTiktoken
modelsHFTokenizer
as a second derivedBPETokenizerBase
implementation which adds aPreTokenizer
/TokenDecoder
to the input/output logictokenize_tool.cpp
to sanity check tokenizers. This was mostly just for my own sanity, so it may not have much utility in the future, but I figured it may also be a useful development tool so left it in. It might make sense to move it somewhere else?Build and Test
tokenizer/tests
dir to hold c++ unit testsmain.cpp
for allgtest
unit teststokenizer/CMakeLists.txt
so that it runs withctest
bin/test
and can be executed directly as wellpre_tokenizer
stack, but nothing else (yet!)Testing
Build setup
Unit testing
Spot testing
I also did spot testing with several tokenizer models:
Stil TODO
tiktoken
hard-coded special tokens)Future work
PreTokenizers
andTokenDecoders
that are not yet supported. There's also the wholePostProcessors
suite as well. A good test case for this would be thetokenizers
version of thellama3.1
tokenizer (here)