Release Release v0.3.2 · sgl-project/sglang

Highlight

Support torch.compile, cuda graph for triton attention backend and DeepSeek MLA #1442 #1422
Initial support for multi-LoRA serving #1307
Integrate torchao for quantization #1341
Optimize the CPU scheduler overhead
Multiple critical bug fixes for llama and llava (tokenizer, modality)
Support AMD backend #1420
New models: MiniCPM3, OLMoE

What's Changed

Remove useless fields in global_config.py by @merrymercy in #1328
docs: update README by @zhyncs in #1336
docs: highlight ttft itl and throughput by @zhyncs in #1337
docs: add conclusion by @zhyncs in #1340
Optimize schedule by @hnyls2002 in #1339
Fix some online scheduling delay by @hnyls2002 in #1345
[triton] Support head_dim not 2^n in triton extend and decode attention by @ByronHsu in #1281
[Feat] Add modalities for vision server when handling pixel values for llava by @kcz358 in #1346
[server] Passing model_override_args to launch_server via the CLI. by @kevin85421 in #1298
[Minor] Many cleanup by @merrymercy in #1357
Add torchao quant (int4/int8/fp8) to llama models by @jerryzh168 in #1341
[CI] Return output logprobs in unit test by @Ying1123 in #1361
Unify forward mode by @hnyls2002 in #1360
Support OpenAI API json_schema response format by @zifeitong in #1363
Adding Documentation for installation by @zhaochenyang20 in #1300
[Docs] Improve documentations by @merrymercy in #1368
fix bug of undefined is_single in meth create_abort_task by @wcsjtu in #1370
Support MiniCPM3 by @Achazwl in #1371
Fix CORS compatibility with OpenAI, vLLM, TGI, LMDeploy by @josephrocca in #1373
[Minor] improve kill scripts and torchao import by @merrymercy in #1375
Fix vocab mask update bug by @hnyls2002 in #1376
[Minor] move triton attention kernels into a separate folder by @merrymercy in #1379
Deprecate --disable-flashinfer and introduce --attention-backend by @merrymercy in #1380
Organize flashinfer indices update by @hnyls2002 in #1378
remove assertion in triton attention and add an unit test by @ByronHsu in #1385
BaiChuan2 Model by @blacker521 in #1367
[Fix] Fix --disable-flashinfer by @merrymercy in #1389
Improve error reporting during server launch by @merrymercy in #1390
Refactor attention backend by @merrymercy in #1381
Add no commit to main rule by @hnyls2002 in #1393
Fix README format by @Achazwl in #1399
Support cuda graph in the triton attention backend by @merrymercy in #1401
kernel: use tensor cores for flashinfer gqa kernels by @yzh119 in #1403
[Minor Fix] Fix llava modalities issue for single-image by @kcz358 in #1402
Add Support for XVERSE Models (Dense and MoE) to sglang by @hxer7963 in #1397
[Feature] Initial support for multi-LoRA serving by @Ying1123 in #1307
[Minor, CI] remove lora test from minimal suite by @Ying1123 in #1406
Make stop reason a dict instead of str by @merrymercy in #1407
[CI] Include triton backend and online serving benchmark into CI by @merrymercy in #1408
[Minor] Raise exception for wrong import by @Ying1123 in #1409
Balance test in CI by @merrymercy in #1411
Update pr-test.yml by @merrymercy in #1412
ci: fix finish by @zhyncs in #1414
Optimize conflicts between CUDA graph and vocab mask tensors by @hnyls2002 in #1392
Add torchao quant for mixtral and qwen_moe by @jerryzh168 in #1418
Add pytorch sampling backend ut by @ispobock in #1425
fix: resolve nightly eval by @zhyncs in #1426
Enable torch.compile for triton backend by @merrymercy in #1422
Add libibverbs-dev to Dockerfile by @Aphoh in #1427
Update backend.md by @merrymercy in #1429
[Fix] Fix logprob and normalized_logprob by @merrymercy in #1428
Release v0.3.1 by @merrymercy in #1430
Remove deprecated configs by @merrymercy in #1431
[Feature] Support LoRA path renaming and add LoRA serving benchmarks by @Ying1123 in #1433
Revert "[Minor] Raise exception for wrong import (#1409)" by @Ying1123 in #1432
Add constrained_json_whitespace_pattern to ServerArgs by @zifeitong in #1438
Clean up model loader by @merrymercy in #1440
Simplify sampler and its error handling by @merrymercy in #1441
[Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm by @HaiShaw in #1420
Fix torch compile for deepseek-v2 by @ispobock in #1442
Add OLMoE model by @janimo in #1444
Release 0.3.1.post1 by @merrymercy in #1445
Enable MLA by default by @ispobock in #1447
Fix attention backend by @ispobock in #1448
fix schedule bug by @hnyls2002 in #1450
Fix schedule bug by @hnyls2002 in #1451
Fixed n>1 causing list index out of range with VLM by @jasonyux in #1449
Add bench_server_latency.py by @merrymercy in #1452
[Bugfix] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1419) by @HaiShaw in #1453
Fix oom issues with fp8 for llama by @merrymercy in #1454
Fuse top_k and top_k in the sampler by @merrymercy in #1457
[Event] Add public meeting invite to README by @Ying1123 in #1458
fix: creat new dict everytime for putting new frame by @Luodian in #1464
Fix padding in the cuda graph by @merrymercy in #1469
Release v0.3.1.post2 by @merrymercy in #1470
Fix env vars in bench_latency by @merrymercy in #1472
feat: update linear deps 1/N by @zhyncs in #1305
minor: add quant eval compared with base by @zhyncs in #1475
Add OLMoE by @Muennighoff in #1476
Fix triton head num by @ispobock in #1482
Add MLA gsm8k eval by @ispobock in #1484
chore: bump v0.3.1.post3 by @zhyncs in #1483
fix incorrect links in documentation by @rchen19 in #1481
doc: update backend by @zhyncs in #1486
Better unit tests for adding a new model by @merrymercy in #1488
Pr fix max workers by @wellhowtosay in #1456
Add a unit test for data parallelism by @merrymercy in #1489
Add AMD tests to CI by @Ying1123 in #1491
Update dockerfile to include datamodel_code_generator by @merrymercy in #1492
[API, Feature] Support response prefill for openai API by @Ying1123 in #1490
minor: add mla fp8 test by @zhyncs in #1494
Fix the overhead due to penalizer in bench_latency by @merrymercy in #1496
MoE torch compile by @ispobock in #1497
[CI] Move AMD test to a separate file by @merrymercy in #1500
Update test_srt_backend.py by @merrymercy in #1502
debug radixcache stack_overflow by @luzengxiangcn in #1499
Simplify bench_latency.py by @merrymercy in #1503
[Fix] Fix clean_up_tokenization_spaces in tokenizer by @merrymercy in #1510
Add support for tie_word_embeddings when loading weights + support for SmolLM by @TianyiQ in #1508
Revert "kernel: use tensor cores for flashinfer gqa kernels" by @Ying1123 in #1511
Release v0.3.2 by @Ying1123 in #1512

New Contributors

@zifeitong made their first contribution in #1363
@wcsjtu made their first contribution in #1370
@Achazwl made their first contribution in #1371
@josephrocca made their first contribution in #1373
@blacker521 made their first contribution in #1367
@yzh119 made their first contribution in #1403
@hxer7963 made their first contribution in #1397
@Aphoh made their first contribution in #1427
@HaiShaw made their first contribution in #1420
@jasonyux made their first contribution in #1449
@Muennighoff made their first contribution in #1476
@rchen19 made their first contribution in #1481
@wellhowtosay made their first contribution in #1456
@luzengxiangcn made their first contribution in #1499
@TianyiQ made their first contribution in #1508

Full Changelog: v0.3.0...v0.3.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.3.2

Highlight

What's Changed

New Contributors

Contributors