sglang Dense LLM PD分离部署
·
安装sglang
pip install "sglang[all]>=0.4.6.post5"
Dense模型常规部署
python3 -m sglang.launch_server --model-path Qwen/Qwen2-1.5B-Instruct
简单测试
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen2-1.5B-Instruct", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
benchmark
python3 -m sglang.bench_one_batch_server \
--model-path Qwen/Qwen2-1.5B-Instruct \
--base-url http://localhost:30000 \
--batch-size 512 \
--input-len 1024 \
--output-len 5 \
--skip-warmup
Dense模型PD部署
PD分离几个相关的ServerArgs
# Disaggregation
parser.add_argument(
"--disaggregation-mode",
type=str,
default="null",
choices=["null", "prefill", "decode"],
help='Only used for PD disaggregation. "prefill" for prefill-only server, and "decode" for decode-only server. If not specified, it is not PD disaggregated',
)
parser.add_argument(
"--disaggregation-bootstrap-port",
type=int,
default=ServerArgs.disaggregation_bootstrap_port,
help="Bootstrap server port on the prefill server. Default is 8998.",
)
parser.add_argument(
"--disaggregation-transfer-backend",
type=str,
default=ServerArgs.disaggregation_transfer_backend,
choices=["mooncake", "nixl"],
help="The backend for disaggregation transfer. Default is mooncake.",
)
parser.add_argument(
"--disaggregation-ib-device",
type=str,
default=ServerArgs.disaggregation_ib_device,
help="The ib device for disaggregation transfer. Default is None, it will be detected automatically if using the mooncake backend.",
)
mooncake传输引擎安装:
pip install mooncake-transfer-engine
mooncake transfer engine需要设置ib-device。使用ibv_devices命令查看网卡设备用于设置disaggregation-ib-device,但可能不是所有网卡都是RDMA。
nixl最好使用镜像,直接pip安装的可能有问题,nixl不需要像mooncake设置ib-device。
这里演示一个单节点内部不同GPU之间的PD分离部署:
启动Prefill 0:
export CUDA_VISIBLE_DEVICES=0
python3 -m sglang.launch_server \
--model-path Qwen/Qwen2-1.5B-Instruct \
--trust-remote-code \
--mem-fraction-static 0.85 \
--disaggregation-mode prefill \
--disaggregation-bootstrap-port 8990 \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device "mlx5_0,mlx5_1" \
--page-size 16 \
--disable-radix-cache \
--host 0.0.0.0 \
--port 30000
启动Prefill 1:
export CUDA_VISIBLE_DEVICES=1
python3 -m sglang.launch_server \
--model-path Qwen/Qwen2-1.5B-Instruct \
--trust-remote-code \
--mem-fraction-static 0.85 \
--disaggregation-mode prefill \
--disaggregation-bootstrap-port 8991 \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device "mlx5_0,mlx5_1" \
--page-size 16 \
--disable-radix-cache \
--host 0.0.0.0 \
--port 30001
启动decoding 0:
export SGLANG_NUM_RESERVED_DECODE_TOKENS=512
export CUDA_VISIBLE_DEVICES=7
python3 -m sglang.launch_server \
--model-path Qwen/Qwen2-1.5B-Instruct \
--trust-remote-code \
--cuda-graph-bs 1 2 4 8 16 24 32 40 64 80 96 128 144 160 196 256 \
--mem-fraction-static 0.85 \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device "mlx5_0,mlx5_1" \
--page-size 16 \
--disable-radix-cache \
--host 0.0.0.0 \
--port 30007
prefill设置MC_TE_METRIC=true环境变量可以打印mooncake transfer engine的传输性能。
启动load balancer
0.4.6.post5把分隔从逗号改成了空格,例如:
python3 -m sglang.srt.disaggregation.mini_lb \
--port 8000 \
--prefill http://localhost:30000 http://localhost:30001 \
--prefill-bootstrap-ports 8990 8991 \
--decode http://localhost:30007
需要注意的是同一个节点内PD要设置不同的端口号,不同节点无所谓。
支持xPyD,不同实例端口和prefill-bootstrap-ports用逗号隔开。
测试是否成功:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen2-1.5B-Instruct", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
这里curl是与mini_lb暴露的端口进行通信。
更多推荐


所有评论(0)