-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizers tokenizer #1261
Tokenizers tokenizer #1261
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1261
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 4a20f69 with merge base f20f5e7 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
f2cba4c
to
3554c3e
Compare
@Jack-Khuu This PR is now the tip of the chain. I've opened it up to review, but I suspect this one will need a lot more discussion than the others. As an FYI, I'm working on a c++ implementation that would support |
Moving conversation on the various open questions here. I think I've just discovered part of why converting from One of the main differences between the UPDATE: Further digging shows this might still be ok for standard cases. For Granite Code at least, the ordering of the tokens in the |
3554c3e
to
c66ac78
Compare
Pardon the delay: I've been OOO (still am) Thanks again!! |
Not a problem at all, I've been distracted on other threads too. I have some partial work towards a native |
Thanks again @gabe-l-hart, feel free to loop me into the other threads (HF?) if you think it'll help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope of this PR (i.e. we'll fix afterwards), but we should probably move toward using an enum for the tokenizer to save us some headache
@@ -23,6 +23,8 @@ | |||
import tiktoken | |||
from tiktoken.load import load_tiktoken_bpe | |||
|
|||
from .base import TokenizerBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use the full path?
from .base import TokenizerBase | |
from tokenizer.base import TokenizerBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, no, I have tended towards relative imports for local (the mental equivalent of #inlclude "foo.h"
vs #include <string>
for local files vs standard/third party). Definitely no strong preference though! I'd much rather stay consistent with the rest of the project.
torchchat/cli/builder.py
Outdated
@@ -193,6 +193,7 @@ class TokenizerArgs: | |||
tokenizer_path: Optional[Union[Path, str]] = None | |||
is_sentencepiece: bool = False | |||
is_tiktoken: bool = False | |||
is_tokenizers: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since tokenizers as a general term is overloaded
is_tokenizers: bool = False | |
is_hf_tokenizers: bool = False |
tokenizer/tokenizers.py
Outdated
from .base import TokenizerBase | ||
|
||
|
||
class TokenizersTokenizer(TokenizerBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class TokenizersTokenizer(TokenizerBase): | |
class HFTokenizer(TokenizerBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto with the file name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I like that. I was struggling with the generic name and things like TokenizersTokenizer
just sound bad. I'll rename the file in a separate commit since I can't stage that as a suggestion.
torchchat/cli/builder.py
Outdated
@@ -229,16 +244,27 @@ def validate_model( | |||
if model is None: | |||
return | |||
|
|||
if self.is_tiktoken == self.is_sentencepiece: | |||
if len(list(filter(lambda x: x, [self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]))) != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(list(filter(lambda x: x, [self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]))) != 1: | |
if sum([self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]) != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, that's way simpler!
torchchat/model.py
Outdated
use_tiktoken: bool = False | ||
use_tokenizers: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_tokenizers: bool = False | |
use_hf_tokenizers: bool = False |
torchchat/model.py
Outdated
@@ -329,12 +331,14 @@ class ModelArgs: | |||
model_type: ModelType | |||
transformer_args: Dict[str, Dict[str, Any]] | |||
use_tiktoken: bool | |||
use_tokenizers: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_tokenizers: bool | |
use_hf_tokenizers: bool |
c66ac78
to
87bcf5c
Compare
Mini Update: We have some engineers internally who may be interested on helping on C++ front if you get stuck btw |
That's great! I'm finally getting back to this. Will push updates for your suggestions and will push a branch with the very WIP c++ stuff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed changes with all the renames. Thanks for the suggestion!
@@ -23,6 +23,8 @@ | |||
import tiktoken | |||
from tiktoken.load import load_tiktoken_bpe | |||
|
|||
from .base import TokenizerBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, no, I have tended towards relative imports for local (the mental equivalent of #inlclude "foo.h"
vs #include <string>
for local files vs standard/third party). Definitely no strong preference though! I'd much rather stay consistent with the rest of the project.
tokenizer/tokenizers.py
Outdated
from .base import TokenizerBase | ||
|
||
|
||
class TokenizersTokenizer(TokenizerBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I like that. I was struggling with the generic name and things like TokenizersTokenizer
just sound bad. I'll rename the file in a separate commit since I can't stage that as a suggestion.
torchchat/cli/builder.py
Outdated
@@ -229,16 +244,27 @@ def validate_model( | |||
if model is None: | |||
return | |||
|
|||
if self.is_tiktoken == self.is_sentencepiece: | |||
if len(list(filter(lambda x: x, [self.is_tiktoken, self.is_tokenizers, self.is_sentencepiece]))) != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, that's way simpler!
ca7f7ee
to
5f332e7
Compare
Oops, missed one place to change the name. Should be fixed now. |
very WIP on the |
…support Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
…tokenizers This allows for all HF tokenizers to be supported in the python layer. It will need significant work to offer similar compatibility at the c++ layer. Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
…kenizer Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersTokenizer-1251 Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>
5f332e7
to
4a20f69
Compare
Once all the changes needed to support granite have landed, be sure to add the models to the known model json (added: just saw #1336 which does that) and the README.md model list, please? Also, at that point when the granite models work with the code that's checked in, is there a smallish granite model (ideally without special license that needs to be accepted, avoiding having to deal with HF tokens as github secrets?) that could be run as end-to-end test? |
All Granite models (starting with the Granite Code ones) are under Apache-2. The smallest Granite Code is the 3b model which is admittedly not CI/CD sized. Once we start tackling the |
For the discussion around |
Dependencies
This PR is part of a sequence in support of adding Granite Code. It depends on merging the following PRs:
Issues
Closes #1251
Description
This PR adds partial support for models that use the
tokenizers
(as opposed totiktoken
orsentencepiece
) for tokenization. This PR only addresses support in thepython
runner, and it does so by creating a new class in thetokenizer
module that simply wrapstokenizers
.Discussion
I'm not sure this is the correct direction to go for solving this since the
tokenizers
library is not (to the best of my knowledge) portable to the various export formats (yet). There are two main challenges to extending more tokenizer support outside of simply wrappingtokenizers
:Pre-tokenizers
For may tokenizers, multiple regexes are used in sequence to split the raw string. Not being a regex expert myself, it's not immediately clear to me if it's possible to merge this kind of multi-pass splitting into a single regex. For other tokenizers, a single regex is used, but it is a different expression than any of those currently implemented in
tiktoken
.From my investigation, I think there are a few candidate paths forward:
c++
implementation of the various tokenization routines fromtokenizers
in a separate implementation of theTokenizer
class.c++
TikToken
class to support multiple regexes in the pre-tokenizertokenizer.model
artifact, or somehow making these tokenizer arguments an argument at instantiation time.NOTE: The corresponding tokenization in
llama.cpp
lives here. This code is a full implementation of a unified tokenizer with configuration to dispatch between known patterns and optimized implementations. The config for the model that indicates which tokenizer to use is stored in the model'sGGUF
file directly, so at load time, the correct tokenizer is found based on that value.Special Tokens
Even for models that use a single regex (and even the
llama
regex), models may use different special tokens for special functionality (chat template, FIM, tool calling, other custom prompting). Since thetokenizer.model
, only the vocab is stored, so there is not currently any way to note the special tokens in serialization (similar to the need for configuration of pre-tokenizers).