Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to 2.3.1 #225

Open
wants to merge 332 commits into
base: habana-main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
332 commits
Select commit Hold shift + click to select a range
bc15e96
Fixing gemma2. (#2135)
Narsil Jun 27, 2024
6951486
fix: refactor post_processor logic and add test (#2137)
drbh Jun 27, 2024
8721b60
fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_…
sywangyi Jul 1, 2024
03691f6
Fixing clippy. (#2149)
Narsil Jul 1, 2024
3e02d4f
fix: use weights from base_layer (#2141)
drbh Jul 1, 2024
de96056
feat: download lora adapter weights from launcher (#2140)
drbh Jul 1, 2024
e0d168b
Use GPTQ-Marlin for supported GPTQ configurations (#2111)
danieldk Jul 1, 2024
5b977c3
fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123)
icyxp Jul 1, 2024
6265956
refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel pla…
sywangyi Jul 1, 2024
381c5c0
fix: prefer serde structs over custom functions (#2127)
drbh Jul 1, 2024
2b9339c
Fixing baichuan override. (#2158)
Narsil Jul 1, 2024
b80bd72
Move to FlashDecoding instead of PagedAttention kernel. (#1940)
Narsil Jul 1, 2024
9b3d3a3
Fixing graph capture for flash decoding. (#2163)
Narsil Jul 2, 2024
71b0189
fix FlashDecoding change's regression in intel platform (#2161)
sywangyi Jul 2, 2024
e913f3a
fix: use the base layers weight in mistral rocm (#2155)
drbh Jul 2, 2024
bc5a792
Fixing rocm. (#2164)
Narsil Jul 2, 2024
d580215
Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167)
Narsil Jul 2, 2024
233e464
feat: improve update_docs for openapi schema (#2169)
drbh Jul 3, 2024
b6c8984
Fixing missing `object` field for regular completions.
Narsil Jul 3, 2024
878491c
Revert "Fixing missing `object` field for regular completions."
Narsil Jul 3, 2024
64989f9
Fixing the dockerfile warnings. (#2173)
Narsil Jul 3, 2024
e93c830
Fixing missing `object` field for regular completions. (#2175)
Narsil Jul 3, 2024
74ddd12
Version 2.1.1
Narsil Jul 4, 2024
2e09ebe
Preparing patch release. (#2186)
Narsil Jul 4, 2024
835ad0a
Adding "longrope" for Phi-3 (#2172) (#2179)
amihalik Jul 5, 2024
1b434e8
Refactor dead code - Removing all `flash_xxx.py` files. (#2166)
Narsil Jul 5, 2024
e481a9b
Hotfixing after refactor.
Narsil Jul 5, 2024
1e7ce69
Fix Starcoder2 after refactor (#2189)
danieldk Jul 5, 2024
54c194d
GPTQ CI improvements (#2151)
danieldk Jul 5, 2024
508e308
Consistently take `prefix` in model constructors (#2191)
danieldk Jul 5, 2024
8e3d1e6
fix dbrx & opt model prefix bug (#2201)
icyxp Jul 8, 2024
f11fd69
hotfix: Fix number of KV heads (#2202)
danieldk Jul 8, 2024
1759491
Fix incorrect cache allocation with multi-query (#2203)
danieldk Jul 8, 2024
540e710
Falcon/DBRX: get correct number of key-value heads (#2205)
danieldk Jul 8, 2024
8dd9b2b
add doc for intel gpus (#2181)
sywangyi Jul 8, 2024
4a54e41
fix: python deserialization (#2178)
jaluma Jul 8, 2024
74edda9
update to metrics 0.23.0 or could work with metrics-exporter-promethe…
sywangyi Jul 8, 2024
48f1196
feat: use model name as adapter id in chat endpoints (#2128)
drbh Jul 8, 2024
eaaea91
Fix nccl regression on PyTorch 2.3 upgrade (#2099)
fxmarty Jul 8, 2024
591f9f7
Adding sanity check to openapi docs.
Narsil Jul 9, 2024
cc4fceb
Updating the self check (#2209)
Narsil Jul 9, 2024
2a6c3ca
Move quantized weight handling out of the `Weights` class (#2194)
danieldk Jul 9, 2024
85c3c5d
Add support for FP8 on compute capability >=8.0, <8.9 (#2213)
danieldk Jul 11, 2024
5029e72
fix: append DONE message to chat stream (#2221)
drbh Jul 11, 2024
dedeb3c
Modifying base in yarn embedding (#2212)
SeongBeomLEE Jul 12, 2024
ee56266
Use symmetric quantization in the `quantize` subcommand (#2120)
danieldk Jul 12, 2024
619eede
feat: simple mistral lora integration tests (#2180)
drbh Jul 15, 2024
271ebb7
fix custom cache dir (#2226)
ErikKaum Jul 15, 2024
8a223eb
fix: Remove bitsandbytes installation when running cpu-only install (…
Hugoch Jul 15, 2024
e955f7b
Add support for AWQ-quantized Idefics2 (#2233)
danieldk Jul 16, 2024
7177da0
`server quantize`: expose groupsize option (#2225)
danieldk Jul 16, 2024
e0710cc
Remove stray `quantize` argument in `get_weights_col_packed_qkv` (#2237)
danieldk Jul 16, 2024
118ee57
fix(server): fix cohere (#2249)
OlivierDehaene Jul 18, 2024
2dd680b
Improve the handling of quantized weights (#2250)
danieldk Jul 19, 2024
394f8c7
Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255)
danieldk Jul 19, 2024
ba0dfb6
Hotfix: various GPT-based model fixes (#2256)
danieldk Jul 19, 2024
990ea79
Hotfix: fix MPT after recent refactor (#2257)
danieldk Jul 19, 2024
e658d95
Hotfix: pass through model revision in `VlmCausalLM` (#2258)
danieldk Jul 19, 2024
66f3de5
usage stats and crash reports (#2220)
ErikKaum Jul 19, 2024
8afc173
add usage stats to toctree (#2260)
ErikKaum Jul 19, 2024
898a892
fix: adjust default tool choice (#2244)
drbh Jul 19, 2024
c1638a5
Add support for Deepseek V2 (#2224)
danieldk Jul 19, 2024
50149c3
Add FP8 release test (#2261)
danieldk Jul 20, 2024
85f10ec
feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248)
OlivierDehaene Jul 20, 2024
d13215d
fix(server): fix deepseekv2 loading (#2266)
OlivierDehaene Jul 21, 2024
a5aee82
Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269)
icyxp Jul 22, 2024
758a8b8
legacy warning on text_generation client (#2271)
ErikKaum Jul 22, 2024
a7515b8
fix(server): fix fp8 weight loading (#2268)
OlivierDehaene Jul 22, 2024
568cc9f
Softcapping for gemma2. (#2273)
Narsil Jul 22, 2024
31eb03d
Fixing mistral nemo. (#2276)
Narsil Jul 23, 2024
919da25
fix(l4): fix fp8 logic on l4 (#2277)
OlivierDehaene Jul 23, 2024
26460f0
Add support for repacking AWQ weights for GPTQ-Marlin (#2278)
danieldk Jul 23, 2024
69b67b7
Add support for Mistral-Nemo by supporting head_dim through config (#…
shaltielshmid Jul 23, 2024
5390973
Preparing for release. (#2285)
Narsil Jul 23, 2024
43f4914
Add support for Llama 3 rotary embeddings (#2286)
danieldk Jul 23, 2024
b1077b0
hotfix: pin numpy (#2289)
danieldk Jul 23, 2024
34c472b
chore: update to torch 2.4 (#2259)
OlivierDehaene Jul 23, 2024
a994f6a
hotfix: update nccl
OlivierDehaene Jul 23, 2024
2041421
fix crash in multi-modal (#2245)
sywangyi Jul 24, 2024
d939315
fix of use of unquantized weights in cohere GQA loading, also enable …
sywangyi Jul 24, 2024
457791f
Split up `layers.marlin` into several files (#2292)
danieldk Jul 24, 2024
7ebee37
fix: refactor adapter weight loading and mapping (#2193)
drbh Jul 24, 2024
69db13e
Using g6 instead of g5. (#2281)
Narsil Jul 25, 2024
64ffd64
Some small fixes for the Torch 2.4.0 update (#2304)
danieldk Jul 25, 2024
d5e0543
Fixing idefics on g6 tests. (#2306)
Narsil Jul 25, 2024
1674f44
Fix registry name (#2307)
XciD Jul 25, 2024
fc6d80f
Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313)
danieldk Jul 26, 2024
a87791d
feat: add ruff and resolve issue (#2262)
drbh Jul 26, 2024
2c1d280
Run ci api key (#2315)
ErikKaum Jul 29, 2024
23a3927
Install Marlin from standalone package (#2320)
danieldk Jul 29, 2024
a574381
fix: reject grammars without properties (#2309)
drbh Jul 29, 2024
b1d1d26
patch-error-on-invalid-grammar (#2282)
ErikKaum Jul 29, 2024
bafab73
fix: adjust test snapshots and small refactors (#2323)
drbh Jul 29, 2024
247a29f
server quantize: store quantizer config in standard format (#2299)
danieldk Jul 30, 2024
120d577
Rebase TRT-llm (#2331)
Narsil Jul 31, 2024
468e5c6
Handle GPTQ-Marlin loading in `GPTQMarlinWeightLoader` (#2300)
danieldk Jul 31, 2024
c73d1d6
Pr 2290 ci run (#2329)
drbh Jul 31, 2024
3c4f816
refactor usage stats (#2339)
ErikKaum Jul 31, 2024
d70da59
enable HuggingFaceM4/idefics-9b in intel gpu (#2338)
sywangyi Aug 1, 2024
ccddb30
Fix cache block size for flash decoding (#2351)
danieldk Aug 1, 2024
48fec7b
Unify attention output handling (#2343)
danieldk Aug 1, 2024
688321b
fix: attempt forward on flash attn2 to check hardware support (#2335)
drbh Aug 5, 2024
8b0f5fe
feat: include local lora adapter loading docs (#2359)
drbh Aug 5, 2024
83d1f23
fix: return the out tensor rather then the functions return value (#2…
drbh Aug 6, 2024
88e07f1
feat: implement a templated endpoint for visibility into chat request…
drbh Aug 6, 2024
b4562e1
feat: prefer stop over eos_token to align with openai finish_reason (…
drbh Aug 6, 2024
5400c71
feat: return the generated text when parsing fails (#2353)
drbh Aug 6, 2024
db873be
fix: default num_ln_in_parallel_attn to one if not supplied (#2364)
drbh Aug 6, 2024
3ccde43
fix: prefer original layernorm names for 180B (#2365)
drbh Aug 6, 2024
11fab8a
fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350)
almersawi Aug 7, 2024
3ea8e8a
add gptj modeling in TGI #2366 (CI RUN) (#2372)
drbh Aug 8, 2024
9b1b545
Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371)
drbh Aug 8, 2024
06b638f
Pr 2374 ci branch (#2378)
drbh Aug 8, 2024
3893d00
fix EleutherAI/gpt-neox-20b does not work in tgi (#2346)
sywangyi Aug 8, 2024
1057f28
Pr 2337 ci branch (#2379)
drbh Aug 8, 2024
853fb96
fix: prefer hidden_activation over hidden_act in gemma2 (#2381)
drbh Aug 8, 2024
b1bc0ec
Update Quantization docs and minor doc fix. (#2368)
Vaibhavs10 Aug 8, 2024
6f2a468
Pr 2352 ci branch (#2382)
drbh Aug 9, 2024
4a16da5
Add FlashInfer support (#2354)
danieldk Aug 9, 2024
dc0fa60
Add experimental flake (#2384)
danieldk Aug 9, 2024
afa14b7
Using HF_HOME instead of CACHE to get token read in addition to model…
Narsil Aug 9, 2024
e9ba044
flake: add fmt and clippy (#2389)
danieldk Aug 9, 2024
1d4a35a
Update documentation for Supported models (#2386)
Vaibhavs10 Aug 9, 2024
df719fd
flake: use rust-overlay (#2390)
danieldk Aug 9, 2024
849bd93
Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385)
Narsil Aug 9, 2024
959add5
feat: add guideline to chat request and template (#2391)
drbh Aug 9, 2024
bb83338
Update flake for 9.0a capability in Torch (#2394)
danieldk Aug 9, 2024
197dd3a
nix: add router to the devshell (#2396)
danieldk Aug 12, 2024
8750dc8
Upgrade fbgemm (#2398)
Narsil Aug 12, 2024
fbe59c6
Adding launcher to build. (#2397)
Narsil Aug 12, 2024
1daaddd
Fixing import exl2 (#2399)
Narsil Aug 12, 2024
b8efd6d
Cpu dockerimage (#2367)
sywangyi Aug 12, 2024
f586cc7
Add support for prefix caching to the v3 router (#2392)
danieldk Aug 12, 2024
6393cde
Keeping the benchmark somewhere (#2401)
Narsil Aug 12, 2024
8e6bfa2
feat: validate template variables before apply and improve sliding wi…
drbh Aug 12, 2024
3079865
fix: allocate tmp based on sgmv kernel if available (#2345)
drbh Aug 12, 2024
96e8fa3
fix: improve completions to send a final chunk with usage details (#2…
drbh Aug 12, 2024
18d6be6
Updating the flake. (#2404)
Narsil Aug 12, 2024
1f8c0f8
Pr 2395 ci run (#2406)
drbh Aug 12, 2024
10b2be6
fix: include create_exllama_buffers and set_device for exllama (#2407)
drbh Aug 12, 2024
eb561bb
nix: incremental build of the launcher (#2410)
danieldk Aug 13, 2024
c5e4c18
Adding more kernels to flake. (#2411)
Narsil Aug 13, 2024
7a4d831
add numa to improve cpu inference perf (#2330)
sywangyi Aug 13, 2024
ffc8fb0
fix: adds causal to attention params (#2408)
drbh Aug 13, 2024
bae161a
nix: partial incremental build of the router (#2416)
danieldk Aug 14, 2024
4baa6ff
Upgrading exl2. (#2415)
Narsil Aug 14, 2024
c3401e0
More fixes trtllm (#2342)
mfuntowicz Aug 14, 2024
e5c39a5
nix: build router incrementally (#2422)
danieldk Aug 15, 2024
df6ea89
Fixing exl2 and other quanize tests again. (#2419)
Narsil Aug 15, 2024
f0181ed
Upgrading the tests to match the current workings. (#2423)
Narsil Aug 15, 2024
20ed7b5
nix: try to reduce the number of Rust rebuilds (#2424)
danieldk Aug 16, 2024
df0e650
Improve the Consuming TGI + Streaming docs. (#2412)
Vaibhavs10 Aug 16, 2024
85df9fc
Further fixes. (#2426)
Narsil Aug 16, 2024
11d25a4
FIxing the CI.
Narsil Aug 16, 2024
53fdbe6
doc: Add metrics documentation and add a 'Reference' section (#2230)
Hugoch Aug 16, 2024
cd208c5
All integration tests back everywhere (too many failed CI). (#2428)
Narsil Aug 16, 2024
ddba272
nix: update to CUDA 12.4 (#2429)
danieldk Aug 19, 2024
635dde8
Prefix caching (#2402)
Narsil Aug 20, 2024
516392d
nix: add pure server to flake, add both pure and impure devshells (#2…
danieldk Aug 20, 2024
a5af557
nix: add `text-generation-benchmark` to pure devshell (#2431)
danieldk Aug 21, 2024
6654c2d
Adding eetq to flake. (#2438)
Narsil Aug 21, 2024
b7d1adc
nix: add awq-inference-engine as server dependency (#2442)
danieldk Aug 21, 2024
92ac02e
nix: add default package (#2453)
danieldk Aug 23, 2024
7aebb95
Fix: don't apply post layernorm in SiglipVisionTransformer (#2459)
drbh Aug 26, 2024
73ebbd0
Pr 2451 ci branch (#2454)
drbh Aug 27, 2024
6793b72
Fixing CI. (#2462)
Narsil Aug 27, 2024
e80b2c2
fix: bump minijinja version and add test for llama 3.1 tools (#2463)
drbh Aug 27, 2024
08834e0
fix: improve regex expression (#2468)
drbh Aug 28, 2024
622c9c3
nix: build Torch against MKL and various other improvements (#2469)
danieldk Aug 29, 2024
4e1ca8d
Lots of improvements (Still 2 allocators) (#2449)
Narsil Aug 29, 2024
990478b
feat: add /v1/models endpoint (#2433)
drbh Aug 29, 2024
61b2f49
update doc with intel cpu part (#2420)
sywangyi Aug 29, 2024
a313355
Tied embeddings in MLP speculator. (#2473)
Narsil Aug 29, 2024
07c70e7
nix: improve impure devshell (#2478)
danieldk Sep 2, 2024
3e17cb7
nix: add punica-kernels (#2477)
danieldk Sep 2, 2024
be5cb0c
fix: enable chat requests in vertex endpoint (#2481)
drbh Sep 2, 2024
34a6399
feat: support lora revisions and qkv_proj weights (#2482)
drbh Sep 2, 2024
c7b495f
hotfix: avoid non-prefilled block use when using prefix caching (#2489)
danieldk Sep 5, 2024
556a870
Adding links to Adyen blogpost. (#2492)
Narsil Sep 5, 2024
d8610a6
Add two handy gitignores for Nix environments (#2484)
danieldk Sep 5, 2024
938a7f3
hotfix: fix regression of attention api change in intel platform (#2439)
sywangyi Sep 5, 2024
1e14a94
nix: add pyright/ruff for proper LSP in the impure devshell (#2496)
danieldk Sep 6, 2024
8ba790a
Fix incompatibility with latest `syrupy` and update in Poetry (#2497)
danieldk Sep 6, 2024
67f44cc
radix trie: add assertions (#2491)
danieldk Sep 6, 2024
0198db1
hotfix: add syrupy to the right subproject (#2499)
danieldk Sep 6, 2024
7c2ed55
Add links to Adyen blogpost (#2500)
martinigoyanes Sep 6, 2024
eb54d95
Fixing more correctly the invalid drop of the batch. (#2498)
Narsil Sep 6, 2024
b67a0cd
Add Directory Check to Prevent Redundant Cloning in Build Process (#2…
vamsivallepu Sep 7, 2024
510d1c7
Prefix test - Different kind of load test to trigger prefix test bugs…
Narsil Sep 11, 2024
c6b568b
Fix tokenization yi (#2507)
Narsil Sep 11, 2024
f32fa56
Fix truffle (#2514)
Narsil Sep 11, 2024
7be7ab7
nix: support Python tokenizer conversion in the router (#2515)
danieldk Sep 12, 2024
7d89718
Add nix test. (#2513)
Narsil Sep 12, 2024
5fc0e0c
fix: pass missing revision arg for lora adapter when loading multiple…
drbh Sep 12, 2024
cbfe9e5
hotfix : enable intel ipex cpu and xpu in python3.11 (#2517)
sywangyi Sep 12, 2024
afe5cae
Use `ratatui` not (deprecated) `tui` (#2521)
strickvl Sep 13, 2024
e8c3293
Add tests for Mixtral (#2520)
danieldk Sep 16, 2024
0110b83
Adding a test for FD. (#2516)
Narsil Sep 16, 2024
0ecbd61
nix: pure Rust check/fmt/clippy/test (#2525)
danieldk Sep 17, 2024
88b72c8
fix: metrics unbounded memory (#2528)
OlivierDehaene Sep 17, 2024
29a93b7
Move to moe-kernels package and switch to common MoE layer (#2511)
danieldk Sep 17, 2024
2d470c8
Stream options. (#2533)
Narsil Sep 19, 2024
c1a99e2
Update to moe-kenels 0.3.1 (#2535)
danieldk Sep 19, 2024
b6ef2bf
doc: clarify that `--quantize` is not needed for pre-quantized models…
danieldk Sep 19, 2024
3519398
hotfix: ipex fails since cuda moe kernel is not supported (#2532)
sywangyi Sep 20, 2024
bd9675c
fix: wrap python basic logs in debug assertion in launcher (#2539)
OlivierDehaene Sep 20, 2024
514a5a7
Preparing for release. (#2540)
Narsil Sep 20, 2024
14fdc4a
Add some missing modification of 2.3.0 because of conflict
yuanwu2017 Sep 25, 2024
bab529c
Make Gaudi adapt to the tgi 2.3.0
yuanwu2017 Sep 26, 2024
67ee45a
Pass the max_batch_total_tokens to causal_lm
yuanwu2017 Oct 10, 2024
8686a0f
Merge branch 'habana-main' into 2.3.0
yuanwu2017 Oct 23, 2024
8ebe77b
Simplify the warmup
yuanwu2017 Oct 24, 2024
b590310
Add missing import package
yuanwu2017 Oct 25, 2024
9aed9d5
nix: remove unused `_server.nix` file (#2538)
danieldk Sep 23, 2024
73e6090
chore: Add old V2 backend (#2551)
OlivierDehaene Sep 24, 2024
79ac2b7
Micro cleanup. (#2555)
Narsil Sep 24, 2024
68cfc94
Hotfixing main (#2556)
Narsil Sep 24, 2024
32d50c2
Add support for scalar FP8 weight scales (#2550)
danieldk Sep 24, 2024
d4f995e
Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 (#2537)
danieldk Sep 24, 2024
8c6d3e0
Update the link to the Ratatui organization (#2546)
orhun Sep 24, 2024
5247f89
Simplify crossterm imports (#2545)
orhun Sep 24, 2024
782130d
Adding note for private models in quick-tour document (#2548)
ariG23498 Sep 24, 2024
25e0edf
Hotfixing main. (#2562)
Narsil Sep 24, 2024
97d4bdd
Cleanup Vertex + Chat (#2553)
Narsil Sep 24, 2024
a684a81
More tensor cores. (#2558)
Narsil Sep 24, 2024
0817643
remove LORA_ADAPTERS_PATH (#2563)
nbroad1881 Sep 24, 2024
6976cf8
Add LoRA adapters support for Gemma2 (#2567)
alvarobartt Sep 26, 2024
bc28f86
Fix build with `--features google` (#2566)
alvarobartt Sep 26, 2024
653193a
Improve support for GPUs with capability < 8 (#2575)
danieldk Sep 27, 2024
f82a3f5
flashinfer: pass window size and dtype (#2574)
danieldk Sep 28, 2024
55fd281
Remove compute capability lazy cell (#2580)
danieldk Sep 30, 2024
6808b2d
Update architecture.md (#2577)
ulhaqi12 Sep 30, 2024
ff905ae
Update ROCM libs and improvements (#2579)
mht-sharma Sep 30, 2024
288bcb0
Add support for GPTQ-quantized MoE models using MoE Marlin (#2557)
danieldk Sep 30, 2024
bdc4739
feat: support phi3.5 moe (#2479)
drbh Sep 30, 2024
692f8dd
Move flake back to tgi-nix `main` (#2586)
danieldk Sep 30, 2024
775e5f4
MoE Marlin: support `desc_act` for `groupsize != -1` (#2590)
danieldk Sep 30, 2024
fa964f8
nix: experimental support for building a Docker container (#2470)
danieldk Oct 1, 2024
51506aa
Mllama flash version (#2585)
Narsil Oct 2, 2024
967e671
Max token capacity metric (#2595)
Narsil Oct 2, 2024
7664d2e
CI (2592): Allow LoRA adapter revision in server launcher (#2602)
drbh Oct 2, 2024
902f526
Unroll notify error into generate response (#2597)
drbh Oct 2, 2024
34e98b1
New release 2.3.1 (#2604)
Narsil Oct 3, 2024
7e282b4
V2.3.1
Narsil Oct 3, 2024
372e071
Fix the issues of tgi-gaudi for v.2.3.1
yuanwu2017 Oct 27, 2024
c23584f
Merge branch 'habana-main' into 2.3.0
yuanwu2017 Oct 27, 2024
4c9856f
Add missing package
yuanwu2017 Oct 28, 2024
fcf2e3a
Fix the prefill warmup issue
yuanwu2017 Nov 1, 2024
c345c73
Merge branch 'habana-main' into 2.3.0
yuanwu2017 Nov 1, 2024
636cdb4
Fix startcode issue
yuanwu2017 Nov 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Empty file added .devcontainer/Dockerfile.trtllm
Empty file.
Empty file added .devcontainer/devcontainer.json
Empty file.
3 changes: 3 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,6 @@ aml
target
server/transformers
server/flash-attention
cmake-build-debug/
cmake-build-release/
Dockerfile*
45 changes: 45 additions & 0 deletions .github/workflows/autodocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
name: Automatic Documentation for Launcher

on:
pull_request:

jobs:
update_docs:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v2

- name: Set up Rust
uses: actions-rs/toolchain@v1
with:
profile: minimal
toolchain: stable

- name: Install Protocol Buffers compiler
run: |
sudo apt-get update
sudo apt-get install -y protobuf-compiler libprotobuf-dev

- name: Install Launcher
id: install-launcher
run: cargo install --path launcher/

- name: Install router
id: install-router
run: cargo install --path backends/v3/

- uses: actions/setup-node@v4
with:
node-version: 22

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'

- name: Check that documentation is up-to-date
run: |
npm install -g @redocly/cli
python update_doc.py --check
191 changes: 191 additions & 0 deletions .github/workflows/build.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
name: Build and push docker image to internal registry

on:
workflow_call:
inputs:
hardware:
type: string
description: Hardware
# options:
# - cuda
# - rocm
# - intel
required: true
release-tests:
description: "Run release integration tests"
required: true
default: false
type: boolean

jobs:
build-and-push:
outputs:
docker_image: ${{ steps.final.outputs.docker_image }}
docker_devices: ${{ steps.final.outputs.docker_devices }}
runs_on: ${{ steps.final.outputs.runs_on }}
label: ${{ steps.final.outputs.label }}
concurrency:
group: ${{ github.workflow }}-build-and-push-image-${{ inputs.hardware }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
runs-on:
group: aws-highmemory-32-plus-priv
permissions:
contents: write
packages: write
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Inject slug/short variables
uses: rlespinasse/[email protected]
- name: Construct harware variables
shell: bash
run: |
case ${{ inputs.hardware }} in
cuda)
export dockerfile="Dockerfile"
export label_extension=""
export docker_devices=""
export runs_on="aws-g6-12xl-plus-priv-cache"
export platform=""
;;
rocm)
export dockerfile="Dockerfile_amd"
export label_extension="-rocm"
export docker_devices="/dev/kfd,/dev/dri"
# TODO Re-enable when they pass.
# export runs_on="amd-gpu-tgi"
export runs_on="ubuntu-latest"
export platform=""
;;
intel-xpu)
export dockerfile="Dockerfile_intel"
export label_extension="-intel-xpu"
export docker_devices=""
export runs_on="ubuntu-latest"
export platform="xpu"
;;
intel-cpu)
export dockerfile="Dockerfile_intel"
export label_extension="-intel-cpu"
export docker_devices=""
export runs_on="ubuntu-latest"
export platform="cpu"
;;
esac
echo $dockerfile
echo "Dockerfile=${dockerfile}"
echo $label_extension
echo $docker_devices
echo $runs_on
echo $platform
echo "DOCKERFILE=${dockerfile}" >> $GITHUB_ENV
echo "LABEL=${label_extension}" >> $GITHUB_ENV
echo "PLATFORM=${platform}" >> $GITHUB_ENV
echo "DOCKER_DEVICES=${docker_devices}" >> $GITHUB_ENV
echo "RUNS_ON=${runs_on}" >> $GITHUB_ENV
echo REGISTRY_MIRROR=$REGISTRY_MIRROR >> $GITHUB_ENV
- name: Initialize Docker Buildx
uses: docker/setup-buildx-action@v3
with:
install: true
buildkitd-config: /tmp/buildkitd.toml
- name: Login to internal Container Registry
uses: docker/login-action@v3
with:
username: ${{ secrets.REGISTRY_USERNAME }}
password: ${{ secrets.REGISTRY_PASSWORD }}
registry: registry.internal.huggingface.tech
- name: Login to GitHub Container Registry
if: github.event_name != 'pull_request'
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Login to Azure Container Registry
if: github.event_name != 'pull_request'
uses: docker/login-action@v3
with:
username: ${{ secrets.AZURE_DOCKER_USERNAME }}
password: ${{ secrets.AZURE_DOCKER_PASSWORD }}
registry: db4c2190dd824d1f950f5d1555fbadf0.azurecr.io
# If pull request
- name: Extract metadata (tags, labels) for Docker
if: ${{ github.event_name == 'pull_request' }}
id: meta-pr
uses: docker/metadata-action@v5
with:
images: |
registry.internal.huggingface.tech/api-inference/community/text-generation-inference
tags: |
type=raw,value=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL }}
# If main, release or tag
- name: Extract metadata (tags, labels) for Docker
if: ${{ github.event_name != 'pull_request' }}
id: meta
uses: docker/[email protected]
with:
flavor: |
latest=auto
images: |
registry.internal.huggingface.tech/api-inference/community/text-generation-inference
ghcr.io/huggingface/text-generation-inference
db4c2190dd824d1f950f5d1555fbadf0.azurecr.io/text-generation-inference
tags: |
type=semver,pattern={{version}}${{ env.LABEL }}
type=semver,pattern={{major}}.{{minor}}${{ env.LABEL }}
type=raw,value=latest${{ env.LABEL }},enable=${{ github.ref == format('refs/heads/{0}', github.event.repository.default_branch) }}
type=raw,value=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL }}
- name: Build and push Docker image
id: build-and-push
uses: docker/build-push-action@v4
with:
context: .
file: ${{ env.DOCKERFILE }}
push: true
platforms: 'linux/amd64'
build-args: |
GIT_SHA=${{ env.GITHUB_SHA }}
DOCKER_LABEL=sha-${{ env.GITHUB_SHA_SHORT }}${{ env.LABEL }}
PLATFORM=${{ env.PLATFORM }}
tags: ${{ steps.meta.outputs.tags || steps.meta-pr.outputs.tags }}
labels: ${{ steps.meta.outputs.labels || steps.meta-pr.outputs.labels }}
cache-from: type=s3,region=us-east-1,bucket=ci-docker-buildx-cache,name=text-generation-inference-cache${{ env.LABEL }},mode=min,access_key_id=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_ACCESS_KEY_ID }},secret_access_key=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_SECRET_ACCESS_KEY }},mode=min
cache-to: type=s3,region=us-east-1,bucket=ci-docker-buildx-cache,name=text-generation-inference-cache${{ env.LABEL }},mode=min,access_key_id=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_ACCESS_KEY_ID }},secret_access_key=${{ secrets.S3_CI_DOCKER_BUILDX_CACHE_SECRET_ACCESS_KEY }},mode=min
- name: Final
id: final
run: |
echo "docker_image=registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sha-${{ env.GITHUB_SHA_SHORT}}${{ env.LABEL }}" >> "$GITHUB_OUTPUT"
echo "docker_devices=${{ env.DOCKER_DEVICES }}" >> "$GITHUB_OUTPUT"
echo "runs_on=${{ env.RUNS_ON }}" >> "$GITHUB_OUTPUT"
echo "label=${{ env.LABEL }}" >> "$GITHUB_OUTPUT"
integration_tests:
concurrency:
group: ${{ github.workflow }}-${{ github.job }}-${{ needs.build-and-push.outputs.label }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
needs: build-and-push
runs-on:
group: ${{ needs.build-and-push.outputs.runs_on }}
if: needs.build-and-push.outputs.runs_on != 'ubuntu-latest'
env:
PYTEST_FLAGS: ${{ (startsWith(github.ref, 'refs/tags/') || github.ref == 'refs/heads/main' || inputs.release-tests == true) && '--release' || '--release' }}
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Inject slug/short variables
uses: rlespinasse/[email protected]
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install
run: |
make install-integration-tests
- name: Run tests
run: |
export DOCKER_VOLUME=/mnt/cache
export DOCKER_IMAGE=${{ needs.build-and-push.outputs.docker_image }}
export DOCKER_DEVICES=${{ needs.build-and-push.outputs.docker_devices }}
export HF_TOKEN=${{ secrets.HF_TOKEN }}
echo $DOCKER_IMAGE
pytest -s -vv integration-tests ${PYTEST_FLAGS}
20 changes: 20 additions & 0 deletions .github/workflows/build_documentation.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: Build documentation

on:
push:
paths:
- "docs/source/**"
branches:
- main
- doc-builder*
- v*-release

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
with:
commit_sha: ${{ github.sha }}
package: text-generation-inference
additional_args: --not_python_module
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
19 changes: 19 additions & 0 deletions .github/workflows/build_pr_documentation.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
name: Build PR Documentation

on:
pull_request:
paths:
- "docs/source/**"

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: text-generation-inference
additional_args: --not_python_module
49 changes: 49 additions & 0 deletions .github/workflows/ci_build.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
name: CI build

on:
push:
branches:
- 'main'
tags:
- 'v*'
pull_request:
paths:
- ".github/workflows/build.yaml"
- "integration-tests/**"
- "backends/**"
- "server/**"
- "proto/**"
- "router/**"
- "launcher/**"
- "Cargo.lock"
- "rust-toolchain.toml"
- "Dockerfile"
- "Dockerfile_amd"
- "Dockerfile_intel"
branches:
- "main"
workflow_dispatch:
inputs:
release-tests:
description: "Run release integration tests"
required: true
default: false
type: boolean

jobs:
build:
strategy:
# super important if you want to see all results, even if one fails
# fail-fast is true by default
fail-fast: false
matrix:
hardware: ["cuda", "rocm", "intel-xpu", "intel-cpu"]
uses: ./.github/workflows/build.yaml # calls the one above ^
permissions:
contents: write
packages: write
with:
hardware: ${{ matrix.hardware }}
# https://github.com/actions/runner/issues/2206
release-tests: ${{ inputs.release-tests == true }}
secrets: inherit
26 changes: 26 additions & 0 deletions .github/workflows/client-tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Python Client Tests

on:
pull_request:
paths:
- ".github/workflows/client-tests.yaml"
- "clients/python/**"

jobs:
run_tests:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v1
with:
python-version: 3.9
- name: Install
run: |
cd clients/python && pip install .
- name: Run tests
run: |
pip install pytest pytest-asyncio
export HF_TOKEN=${{ secrets.HF_TOKEN }}
make python-client-tests
Loading