-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for tokenizers
tokenizers
#1251
Comments
Thanks for the updates and context! I'd be interested in seeing what your working implementation is for converting out of HF's tokenizer lib. |
Sure! This is what I have for tiktoken_converter.py"""
Helper module for converting tokenizers from the `tokenizers` package to
tiktoken format for use in torchchat.
"""
# Standard
import base64
import json
# First Party
import argparse
# Third Party
from transformers.convert_slow_tokenizer import bytes_to_unicode
import tokenizers
## Helpers #####################################################################
def unicode_to_bytes():
"""Inversion of the lookup table for byte -> string"""
return {v:k for k, v in bytes_to_unicode().items()}
byte_encoder = bytes_to_unicode()
byte_decoder = unicode_to_bytes()
def token_bytes_to_string(b):
"""
DIRECTLY COPIED FROM `transformers`
https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py#L1461
"""
return "".join([byte_encoder[ord(char)] for char in b.decode("latin-1")])
def token_string_to_bytes(s):
"""Inversion of the conversion done in token_bytes_to_string"""
return bytes([byte_decoder[byt] for byt in s])
def get_tokenizer_state(source_tokenizer: tokenizers.Tokenizer) -> dict:
"""The underlying tokenizer is buried in the rust structs, so it's not
immediately accessible in python. This function leverages the string
serialization to pull out the core of the configuration.
"""
return json.loads(source_tokenizer.to_str())
def extract_pattern(source_tokenizer_state: dict) -> str:
"""Extract the string splitting regex for the pre-tokenizer"""
return source_tokenizer_state['pre_tokenizer']['pretokenizers'][0]['pattern']['Regex']
def extract_special_tokens(source_tokenizer_state: dict) -> dict[str, int]:
"""Extract the special tokens that were added to the vocab"""
return {
itm["content"]: itm["id"]
for itm in source_tokenizer_state["added_tokens"]
if itm["special"]
}
def convert_to_ranks(vocab_dict: dict[str, int]) -> dict[bytes, int]:
"""Convert from string form to the bytes form that is needed by tiktoken"""
return {token_string_to_bytes(k): v for k, v in vocab_dict.items()}
def convert_tokenizers_to_tiktoken(
source_tokenizer: tokenizers.Tokenizer,
) -> dict[bytes, int]:
"""End-to-end converter from tokenizers to tiktoken"""
# Parse the serialized state of the source tokenizer
source_tokenizer_state = get_tokenizer_state(source_tokenizer)
# Extract the vocab from the tokenizer
vocab = source_tokenizer.get_vocab()
# Extract the special tokens from the tokenizer state
special_tokens = extract_special_tokens(source_tokenizer_state)
print("SPECIAL TOKENS:")
for special_token, tok_id in sorted(special_tokens.items(), key=lambda x: x[1]):
print(f'"{special_token}": {tok_id}')
# Remove the special tokens from the vocab
cleaned_vocab = {k: v for k, v in vocab.items() if k not in special_tokens}
# Convert the cleaned vocab to byte form
cleaned_vocab_ranks = convert_to_ranks(cleaned_vocab)
return cleaned_vocab_ranks
def save_tiktoken_model(bpe_ranks: dict[bytes, int], output_path: str):
"""Saves a tiktoken model from an existing tokenizer."""
with open(output_path, "wb") as handle:
for token, rank in sorted(bpe_ranks.items(), key=lambda x: x[1]):
handle.write(base64.b64encode(token) + b" " + str(rank).encode() + b"\n")
def validate_conversion(
source_tokenizer: tokenizers.Tokenizer,
output_file: str,
test_strings: list[str] | None,
test_files: list[str] | None,
):
"""Validate the tokenization between the source and target tokenizers"""
# Local
# NOTE: Local import to avoid hard dependency on torchchat
from tokenizer.tiktoken import Tokenizer
# Load the output tokenizer model with tiktoken in torchchat
target_tokenizer = Tokenizer(output_file)
# Define local comparison function
def compare_tokenization(test_text: str):
source_tokens = source_tokenizer.encode(test_text).ids
target_tokens = target_tokenizer.encode(test_text, bos=False)
if source_tokens != target_tokens:
print("----------------------------")
print("MISMATCH FOUND")
print(f"Test text: {test_text}")
print(f"Source tokens: {source_tokens}")
print(f"Target tokens: {target_tokens}")
print()
#DEBUG
breakpoint()
# Validate on manual strings
for test_string in test_strings or []:
compare_tokenization(test_string)
# Validate on file content
for test_file in test_files or []:
with open(test_file, "r") as handle:
test_text = handle.read()
compare_tokenization(test_text)
## Main ########################################################################
def main():
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("input_file", help="The tokenizer json file to convert.")
parser.add_argument("--output-file", "-o", default="tokenizer.model", help="The filename for the output tokenizer model")
parser.add_argument("--test-string", "-ts", nargs="*", help="Strings to validate on")
parser.add_argument("--test-file", "-tf", nargs="*", help="Files to validate on")
args = parser.parse_args()
# Load the tokenizer from the json file
source_tokenizer = tokenizers.Tokenizer.from_file(args.input_file)
# Do the conversion
bpe_ranks = convert_tokenizers_to_tiktoken(source_tokenizer)
# Save the model
save_tiktoken_model(bpe_ranks, args.output_file)
# Validate if requested
if args.test_string or args.test_file:
validate_conversion(
source_tokenizer,
args.output_file,
args.test_string,
args.test_file
)
if __name__ == "__main__":
main() The main gap is around handling the pretokenizer. In The other piece that is not yet portable is the addition of special tokens other than those used by the |
Draft PR up: #1261 I've noted some more details on the open investigation questions in the Discussion section of the PR. |
@Jack-Khuu I've been digging into the landscape of the c++ code a bit. It looks like in addition to supporting this in
The underlying guts of the Given this, I think we could go one of two ways:
Given the compatibility concerns, my initial preference would be for (2), but I want to kick off the conversation since either one would be a pretty significant change. |
…er classes pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…rings pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…er classes pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…rings pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…nizer_config.json We may still need to load the merges themselves pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
Not a terribly realistic usecase, but this avoids a corner case (that I just might be hitting while tokenizers is stubbed out!) pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
Thanks for the details and analysis, I'll hop over to the PR to comment |
…er classes pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…rings pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…nizer_config.json We may still need to load the merges themselves pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
Not a terribly realistic usecase, but this avoids a corner case (that I just might be hitting while tokenizers is stubbed out!) pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…er classes pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…rings pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…nizer_config.json We may still need to load the merges themselves pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
Not a terribly realistic usecase, but this avoids a corner case (that I just might be hitting while tokenizers is stubbed out!) pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…er classes pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
These was likely needed in the original implementation because of being called with different types for allowed_special, but here they're only ever used with an Encoder type so the template is unnecessary. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
This will allow HFTokenizer to reuse all of the BPE logic with different pre/post tokenization pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…er classes pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…rings pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…nizer_config.json We may still need to load the merges themselves pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
Not a terribly realistic usecase, but this avoids a corner case (that I just might be hitting while tokenizers is stubbed out!) pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…ests pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…ic BPE Committing this now to share it, but will likely rebase as I get back to this once I've handled the pre tokenizers better. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…ma.cpp This is a much more efficient way to get this functionality working than a raw port. The original code caries MIT licensing, so the license is kept with a reference at the top of each file. This does introduce a bit of a redundancy in the regex support since the llama.cpp code relies on the STL versus RE2. This seems ok since it does not introduce an additional depencency, but a future optimization could be to refactor the llama.cpp code to leverage the (faster) RE2 implementation. The tradeoff would be a change in which regexes are supported. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
…rship Some pretokenizers mutate the data as it is split (in particular, the byte- level), so the returned set of pieces must have ownership over their data. This could potentially be a cost hit since those that do not require ownership will be making copies. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
This wraps around the llama.cpp regex splitter. I first attempted a full port of the rust code, but this accomplishes the same goal and is much more efficient than what I would have written. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
Best Practices: https://abseil.io/docs/cpp/guides/strings#string_view pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
This factory should be the primary mechanism for instantiating tokenizers from config json files pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
These was likely needed in the original implementation because of being called with different types for allowed_special, but here they're only ever used with an Encoder type so the template is unnecessary. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
This will allow HFTokenizer to reuse all of the BPE logic with different pre/post tokenization pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
We still need the byte-level decoder support, but this gets the encoding right in simple tests. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
This only supports ByteLevel at this point, so will need to be expanded to support additional types if/when other models need them. pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
pytorch#1251 Branch: TokenizersCpp-1251 Signed-off-by: Gabe Goodhart <[email protected]>
🚀 The feature, motivation and pitch
The request is to extend the tokenizer module in
torchchat
to support tokenizers that use the Huggingface tokenizers library.There are many models out there that use
tokenizers
which won't be able to run intorchchat
until they can be loaded and run either via thetokenizers
library directly or via a conversion totiktoken
orsentencepiece
.Alternatives
It may be possible to convert a
tokenizers
tokenizer to atiktoken
tokenizer. I have a working implementation of this for thellama
tokenizer.json model, however other models that use differenttokenizers
configurations do not work (in particular Granite Code).Additional context
This issue is a piece of the puzzle for adding support for Granite Code 3b/8b which use the llama architecture in transormers, but take advantage several pieces of the architecture that are not currently supported by torchchat. The work-in-progress for Granite Code can be found on my fork: https://github.com/gabe-l-hart/torchchat/tree/GraniteCodeSupport.
I have a less fully-fleshed working version of this that I plan to put up as a Draft PR for discussion. I am not intimately familiar with the algorithmic differences between
tiktoken
and the varioustokenizers
pieces (in particular thepretokenizer
s). My branch has a python implementation that simply wrapstokenizers
, but I have not yet tried to export Granite Code to other formats where I suspect it would break without a correspondingc++
implementation. I plan to investigate this further soon!RFC (Optional)
No response
The text was updated successfully, but these errors were encountered: