Upgrade to 2.3.1 #225

yuanwu2017 · 2024-09-26T06:34:03Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

* Fixing gemma2. * Adding new model.

* fix: refactor post_processor logic and add test * fix: remove dev comment * fix: adjust when post_processor is overridden and improve create_post_processor

huggingface#2148) * fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_indices] Signed-off-by: Wang, Yi A <[email protected]> * Apply suggestions from code review --------- Signed-off-by: Wang, Yi A <[email protected]> Co-authored-by: Nicolas Patry <[email protected]>

GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So let's use it by default if the kernels are installed, the GPU supports it, and the kernels support the configuration. For models generated by `text-generation-server quantize`, use `sym=False`. This subcommand symmetric quantization since the beginning and incorrectly reporting the model to be symmetric will use GPTQ-Marlin (which does not support asymmetric quantization).

…gingface#2123) huggingface#2122

…tform (huggingface#2132) * refine get xpu free memory Signed-off-by: Wang, Yi A <[email protected]> * enable qwen2 in xpu Signed-off-by: Wang, Yi A <[email protected]> * enable gemma/gemma2/phi in intel platform Signed-off-by: Wang, Yi A <[email protected]> --------- Signed-off-by: Wang, Yi A <[email protected]>

* fix: prefer enum for chat object * fix: adjust typo * fix: enum CompletionType not ObjectType * fix: adjust typo * feat: leverage serde for conditional deser * fix: adjust HubTokenizerConfig after rebase * fix: update create_post_processor logic for token type * fix: adjust unwrap syntax in template * Fixing the post processor. --------- Co-authored-by: Nicolas Patry <[email protected]>

…1940) * Using flash decoding Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output. * Making it work on non flash decoding. * Fix Cohere. * Fix non decoding paths. * Rebased. * No need for cache_manager anymore. * Update? * "ipex" -> "cpu" * These do not belong. * Factoring cu_seqlen_qk for better abstracting over every model. * Fixing non flash tests/imports. * Changing return everywhere. * Update mistral past. * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though). * Fixup mistral clamping (had issues with cuda graphs). * No need to recreate anything actually.

…2161) install triton because GPTQParams needs it. Signed-off-by: Wang, Yi A <[email protected]>

…e#2167)

* feat: add pre commit step to force schema update when router changes * fix: prefer improved update_doc and start server and compare * fix: adjust typo * fix: adjust revert typo * fix: update workflow to use update_doc md command * feat: improve workflow to check openapi schema too * fix: adjust timeout for CI * fix: adjust raise condition and install server in ci * fix: install protoc before server * feat: improve update doc and add command to print router schema * fix: adjust autodoc workflow * fix: explicitly install protoc and python * fix: alllow trailing space in openapi schema diff

This reverts commit 2bbb7fa.

) * Fixing missing `object` field for regular completions. * Fixing docs by re-adding missing `Prompt`.

Adding "longrope" for phi-3

…2166) * Refactor dead code. * First working step. * Remove a lot of duplicated code. * More dead code. * More cleanup. * Fix Santacoder test. * Fixing the simple tests. * Fixing sharding. * Fixes for VLM. * Fixing santacoder (num_kv_heads hardcoded). * Removing more dead code. * Fixing `config.n_head`. * Stopping earlier because of `<end_of_utterance>` in idefics2. * Addresses comments. * Removing the dead code. * Fuse back mistral into FlashCausalLM. * Finish removal. * Fixing docs + causal_lm `batch_class`. * Fixing docs + causal.lm. * Add default to Gemma Causality. * Default value for gemma/gemma2. * Wrong default.

* Add more representative Llama GPTQ test The Llama GPTQ test is updated to use a model with the commonly-used quantizer config format and activation sorting. The old test is kept around (but renamed) since it tests the format produced by `text-generation-server quantize`. * Add support for manually triggering a release build

* Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes

specify how to call local adapters

* Add LoRA adapters support for Gemma2 * Make `black` formatting happy

* Fix `cargo build --features google` * Add `cargo test --features google`

* Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s

Remove compute capability lock We are only calling the `get_cuda_capability` function once, so avoiding the cost of multiple calls is not really necessary yet.

* style * update torch * ix issues * fix clone * revert mkl * added custom PA * style * fix style * style * hide env vart * fix mixtral model * add skinny kernel and merge fixes * fixed style * fix issue for sliding window models * addressed review comments * fix import * improved error messag * updated default value * remove import * fix imports after rebase * float16 dep * improve dockerfile * cleaned dockerfile

…ce#2557) This change add support for MoE models that use GPTQ quantization. Currently only models with the following properties are supported: - No `desc_act` with tensor parallelism, unless `group_size=-1`. - No asymmetric quantization. - No AWQ.

* feat: support phi3.5 moe model loading * fix: prefer llama base model and improve rotary logic * feat: return reasonable generation and add integration test * fix: run lint and update docs * fix: rerun lint for openapi docs * fix: prefer do_sample false unless temp is set by user, and update chat tests * fix: small typo adjustments * fix: consolidate long rope paths * fix: revert greedy by default and test changes * Vendor configuration so that we don't have to `trust_remote_code` * Use SparseMoELayer * Add support for dense MoE * Some type annotations * Add the usual model tests * Ruff. --------- Co-authored-by: Daniël de Kok <[email protected]> Co-authored-by: Nicolas Patry <[email protected]>

This change uses the updated Marlin MoE kernel from vLLM to support MoE with activation sorting and groups.

…e#2470) * nix: experimental support for building a Docker image Run using something like: ``` docker run \ --device nvidia.com/gpu=all \ -it --rm -p 8080:80 \ -v $PWD/data:/data \ -v $PWD/tmp:/tmp \ tgi-docker:latest \ --model-id <model_id> ``` * Example of building the Docker image using Nix inside Docker * Stream to make the builder image smaller This avoids storing a Docker image tarball in the image. Instead, stream the layers while doing `docker run`. * Don't spam journalctl on Linux * Other dockerfile. --------- Co-authored-by: Nicolas Patry <[email protected]>

* Working loading state. * Preprocessing. * Working state ? (Broke idefics1 temporarily). * Cleaner condition. * Fix idefics. * Updating config, removing TODO * Mllama * Ugrade transformers 4.45 * Flashing mllama. * Starting to get there. * Working state. * Integrations tests for mllama (cutting to 10 tokens because there seems' to be instability after (meaning size of the batch matters. * Updating model link. * Earlier assert. * Fix vlm ? * remove log. * Force ignore all images but last. * Default dtype bfloat16. * Update integration test after switch to bf16. * Remove dead code. * Removed dead code. * Upgrade the flake to latest transformers/tokenizers * Move to hf tgi-nix * Upgrade to 0.5.0

* adding max_token_capacity_metric * added tgi to name of metric * Adding max capacity metric. * Add description for the metrics --------- Co-authored-by: Edwinhr716 <[email protected]>

…e#2602) allow revision for lora adapters from launcher Co-authored-by: Sida <[email protected]> Co-authored-by: teamclouday <[email protected]>

* feat: unroll notify_error if no tool is choosen * fix: expect simple message when no tool is selected * fix: improve test to avoid notify_error * fix: improve docs and indicate change in expected response * fix: adjust linting in test file

* New release 2.3.1 * Update doc number

Signed-off-by: yuanwu <[email protected]>

mandy-li · 2024-11-12T17:37:58Z

@yuanwu2017 , pls test if any performance regression for llama2, llama3.1, lava-next with this PR

Signed-off-by: yuanwu <[email protected]>

Narsil and others added 30 commits September 24, 2024 03:57

Fixing gemma2. (huggingface#2135)

bc15e96

* Fixing gemma2. * Adding new model.

fix: refactor post_processor logic and add test (huggingface#2137)

6951486

* fix: refactor post_processor logic and add test * fix: remove dev comment * fix: adjust when post_processor is overridden and improve create_post_processor

Fixing clippy. (huggingface#2149)

03691f6

fix: use weights from base_layer (huggingface#2141)

3e02d4f

feat: download lora adapter weights from launcher (huggingface#2140)

de96056

fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (hug…

5b977c3

…gingface#2123) huggingface#2122

Fixing baichuan override. (huggingface#2158)

2b9339c

Fixing graph capture for flash decoding. (huggingface#2163)

9b3d3a3

fix FlashDecoding change's regression in intel platform (huggingface#…

71b0189

…2161) install triton because GPTQParams needs it. Signed-off-by: Wang, Yi A <[email protected]>

fix: use the base layers weight in mistral rocm (huggingface#2155)

e913f3a

Fixing rocm. (huggingface#2164)

bc5a792

Hotfixing qwen2 and starcoder2 (which also get clamping). (huggingfac…

d580215

…e#2167)

Fixing missing object field for regular completions.

b6c8984

Revert "Fixing missing object field for regular completions."

878491c

This reverts commit 2bbb7fa.

Fixing the dockerfile warnings. (huggingface#2173)

64989f9

Fixing missing object field for regular completions. (huggingface#2175

e93c830

) * Fixing missing `object` field for regular completions. * Fixing docs by re-adding missing `Prompt`.

Version 2.1.1

74ddd12

Preparing patch release. (huggingface#2186)

2e09ebe

Adding "longrope" for Phi-3 (huggingface#2172) (huggingface#2179)

835ad0a

Adding "longrope" for phi-3

Hotfixing after refactor.

e481a9b

Fix Starcoder2 after refactor (huggingface#2189)

1e7ce69

Consistently take prefix in model constructors (huggingface#2191)

508e308

* Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes

nbroad1881 and others added 19 commits October 25, 2024 09:01

remove LORA_ADAPTERS_PATH (huggingface#2563)

0817643

specify how to call local adapters

Add LoRA adapters support for Gemma2 (huggingface#2567)

6976cf8

* Add LoRA adapters support for Gemma2 * Make `black` formatting happy

Fix build with --features google (huggingface#2566)

bc28f86

* Fix `cargo build --features google` * Add `cargo test --features google`

flashinfer: pass window size and dtype (huggingface#2574)

f82a3f5

Remove compute capability lazy cell (huggingface#2580)

55fd281

Remove compute capability lock We are only calling the `get_cuda_capability` function once, so avoiding the cost of multiple calls is not really necessary yet.

Update architecture.md (huggingface#2577)

6808b2d

Move flake back to tgi-nix main (huggingface#2586)

692f8dd

MoE Marlin: support desc_act for groupsize != -1 (huggingface#2590)

775e5f4

This change uses the updated Marlin MoE kernel from vLLM to support MoE with activation sorting and groups.

Max token capacity metric (huggingface#2595)

967e671

* adding max_token_capacity_metric * added tgi to name of metric * Adding max capacity metric. * Add description for the metrics --------- Co-authored-by: Edwinhr716 <[email protected]>

CI (2592): Allow LoRA adapter revision in server launcher (huggingfac…

7664d2e

…e#2602) allow revision for lora adapters from launcher Co-authored-by: Sida <[email protected]> Co-authored-by: teamclouday <[email protected]>

New release 2.3.1 (huggingface#2604)

34e98b1

* New release 2.3.1 * Update doc number

V2.3.1

7e282b4

yuanwu2017 changed the title ~~Upgrade to 2.3.0~~ Upgrade to 2.3.1 Oct 27, 2024

yuanwu2017 marked this pull request as ready for review October 27, 2024 20:37

yuanwu2017 added 5 commits October 27, 2024 20:40

Fix the issues of tgi-gaudi for v.2.3.1

372e071

Signed-off-by: yuanwu <[email protected]>

Merge branch 'habana-main' into 2.3.0

c23584f

Add missing package

4c9856f

Signed-off-by: yuanwu <[email protected]>

Fix the prefill warmup issue

fcf2e3a

Signed-off-by: yuanwu <[email protected]>

Merge branch 'habana-main' into 2.3.0

c345c73

This was referenced Nov 7, 2024

Incorrect answer with openai compatible penalty parameters #238

Open

llama3.1-70B-instruct 422 error Template error: unknown test: test iterable is unknown (in <string>:99) #218

Open

Fix startcode issue

636cdb4

Signed-off-by: yuanwu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to 2.3.1 #225

Upgrade to 2.3.1 #225

yuanwu2017 commented Sep 26, 2024

mandy-li commented Nov 12, 2024

Upgrade to 2.3.1 #225

Are you sure you want to change the base?

Upgrade to 2.3.1 #225

Conversation

yuanwu2017 commented Sep 26, 2024

What does this PR do?

Before submitting

Who can review?

mandy-li commented Nov 12, 2024