Release v0.3.2
Highlight
- Support torch.compile, cuda graph for triton attention backend and DeepSeek MLA #1442 #1422
- Initial support for multi-LoRA serving #1307
- Integrate torchao for quantization #1341
- Optimize the CPU scheduler overhead
- Multiple critical bug fixes for llama and llava (tokenizer, modality)
- Support AMD backend #1420
- New models: MiniCPM3, OLMoE
What's Changed
- Remove useless fields in global_config.py by @merrymercy in #1328
- docs: update README by @zhyncs in #1336
- docs: highlight ttft itl and throughput by @zhyncs in #1337
- docs: add conclusion by @zhyncs in #1340
- Optimize schedule by @hnyls2002 in #1339
- Fix some online scheduling delay by @hnyls2002 in #1345
- [triton] Support head_dim not 2^n in triton extend and decode attention by @ByronHsu in #1281
- [Feat] Add modalities for vision server when handling pixel values for llava by @kcz358 in #1346
- [server] Passing
model_override_args
tolaunch_server
via the CLI. by @kevin85421 in #1298 - [Minor] Many cleanup by @merrymercy in #1357
- Add torchao quant (int4/int8/fp8) to llama models by @jerryzh168 in #1341
- [CI] Return output logprobs in unit test by @Ying1123 in #1361
- Unify forward mode by @hnyls2002 in #1360
- Support OpenAI API json_schema response format by @zifeitong in #1363
- Adding Documentation for installation by @zhaochenyang20 in #1300
- [Docs] Improve documentations by @merrymercy in #1368
- fix bug of
undefined is_single
in methcreate_abort_task
by @wcsjtu in #1370 - Support MiniCPM3 by @Achazwl in #1371
- Fix CORS compatibility with OpenAI, vLLM, TGI, LMDeploy by @josephrocca in #1373
- [Minor] improve kill scripts and torchao import by @merrymercy in #1375
- Fix vocab mask update bug by @hnyls2002 in #1376
- [Minor] move triton attention kernels into a separate folder by @merrymercy in #1379
- Deprecate --disable-flashinfer and introduce --attention-backend by @merrymercy in #1380
- Organize flashinfer indices update by @hnyls2002 in #1378
- remove assertion in triton attention and add an unit test by @ByronHsu in #1385
- BaiChuan2 Model by @blacker521 in #1367
- [Fix] Fix --disable-flashinfer by @merrymercy in #1389
- Improve error reporting during server launch by @merrymercy in #1390
- Refactor attention backend by @merrymercy in #1381
- Add no commit to main rule by @hnyls2002 in #1393
- Fix README format by @Achazwl in #1399
- Support cuda graph in the triton attention backend by @merrymercy in #1401
- kernel: use tensor cores for flashinfer gqa kernels by @yzh119 in #1403
- [Minor Fix] Fix llava modalities issue for single-image by @kcz358 in #1402
- Add Support for XVERSE Models (Dense and MoE) to sglang by @hxer7963 in #1397
- [Feature] Initial support for multi-LoRA serving by @Ying1123 in #1307
- [Minor, CI] remove lora test from minimal suite by @Ying1123 in #1406
- Make stop reason a dict instead of str by @merrymercy in #1407
- [CI] Include triton backend and online serving benchmark into CI by @merrymercy in #1408
- [Minor] Raise exception for wrong import by @Ying1123 in #1409
- Balance test in CI by @merrymercy in #1411
- Update pr-test.yml by @merrymercy in #1412
- ci: fix finish by @zhyncs in #1414
- Optimize conflicts between CUDA graph and vocab mask tensors by @hnyls2002 in #1392
- Add torchao quant for mixtral and qwen_moe by @jerryzh168 in #1418
- Add pytorch sampling backend ut by @ispobock in #1425
- fix: resolve nightly eval by @zhyncs in #1426
- Enable torch.compile for triton backend by @merrymercy in #1422
- Add libibverbs-dev to Dockerfile by @Aphoh in #1427
- Update backend.md by @merrymercy in #1429
- [Fix] Fix logprob and normalized_logprob by @merrymercy in #1428
- Release v0.3.1 by @merrymercy in #1430
- Remove deprecated configs by @merrymercy in #1431
- [Feature] Support LoRA path renaming and add LoRA serving benchmarks by @Ying1123 in #1433
- Revert "[Minor] Raise exception for wrong import (#1409)" by @Ying1123 in #1432
- Add constrained_json_whitespace_pattern to ServerArgs by @zifeitong in #1438
- Clean up model loader by @merrymercy in #1440
- Simplify sampler and its error handling by @merrymercy in #1441
- [Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm by @HaiShaw in #1420
- Fix torch compile for deepseek-v2 by @ispobock in #1442
- Add OLMoE model by @janimo in #1444
- Release 0.3.1.post1 by @merrymercy in #1445
- Enable MLA by default by @ispobock in #1447
- Fix attention backend by @ispobock in #1448
- fix schedule bug by @hnyls2002 in #1450
- Fix schedule bug by @hnyls2002 in #1451
- Fixed n>1 causing list index out of range with VLM by @jasonyux in #1449
- Add bench_server_latency.py by @merrymercy in #1452
- [Bugfix] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1419) by @HaiShaw in #1453
- Fix oom issues with fp8 for llama by @merrymercy in #1454
- Fuse top_k and top_k in the sampler by @merrymercy in #1457
- [Event] Add public meeting invite to README by @Ying1123 in #1458
- fix: creat new dict everytime for putting new frame by @Luodian in #1464
- Fix padding in the cuda graph by @merrymercy in #1469
- Release v0.3.1.post2 by @merrymercy in #1470
- Fix env vars in bench_latency by @merrymercy in #1472
- feat: update linear deps 1/N by @zhyncs in #1305
- minor: add quant eval compared with base by @zhyncs in #1475
- Add OLMoE by @Muennighoff in #1476
- Fix triton head num by @ispobock in #1482
- Add MLA gsm8k eval by @ispobock in #1484
- chore: bump v0.3.1.post3 by @zhyncs in #1483
- fix incorrect links in documentation by @rchen19 in #1481
- doc: update backend by @zhyncs in #1486
- Better unit tests for adding a new model by @merrymercy in #1488
- Pr fix max workers by @wellhowtosay in #1456
- Add a unit test for data parallelism by @merrymercy in #1489
- Add AMD tests to CI by @Ying1123 in #1491
- Update dockerfile to include datamodel_code_generator by @merrymercy in #1492
- [API, Feature] Support response prefill for openai API by @Ying1123 in #1490
- minor: add mla fp8 test by @zhyncs in #1494
- Fix the overhead due to penalizer in bench_latency by @merrymercy in #1496
- MoE torch compile by @ispobock in #1497
- [CI] Move AMD test to a separate file by @merrymercy in #1500
- Update test_srt_backend.py by @merrymercy in #1502
- debug radixcache stack_overflow by @luzengxiangcn in #1499
- Simplify bench_latency.py by @merrymercy in #1503
- [Fix] Fix clean_up_tokenization_spaces in tokenizer by @merrymercy in #1510
- Add support for tie_word_embeddings when loading weights + support for SmolLM by @TianyiQ in #1508
- Revert "kernel: use tensor cores for flashinfer gqa kernels" by @Ying1123 in #1511
- Release v0.3.2 by @Ying1123 in #1512
New Contributors
- @zifeitong made their first contribution in #1363
- @wcsjtu made their first contribution in #1370
- @Achazwl made their first contribution in #1371
- @josephrocca made their first contribution in #1373
- @blacker521 made their first contribution in #1367
- @yzh119 made their first contribution in #1403
- @hxer7963 made their first contribution in #1397
- @Aphoh made their first contribution in #1427
- @HaiShaw made their first contribution in #1420
- @jasonyux made their first contribution in #1449
- @Muennighoff made their first contribution in #1476
- @rchen19 made their first contribution in #1481
- @wellhowtosay made their first contribution in #1456
- @luzengxiangcn made their first contribution in #1499
- @TianyiQ made their first contribution in #1508
Full Changelog: v0.3.0...v0.3.2