安装sglang

pip install "sglang[all]>=0.4.6.post5"

Dense模型常规部署

python3 -m sglang.launch_server --model-path Qwen/Qwen2-1.5B-Instruct

简单测试 

curl -s http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen2-1.5B-Instruct", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

benchmark

python3 -m sglang.bench_one_batch_server \
  --model-path Qwen/Qwen2-1.5B-Instruct \
  --base-url http://localhost:30000 \
  --batch-size 512 \
  --input-len 1024 \
  --output-len 5 \
  --skip-warmup

Dense模型PD部署

PD分离几个相关的ServerArgs

# Disaggregation
parser.add_argument(
    "--disaggregation-mode",
    type=str,
    default="null",
    choices=["null", "prefill", "decode"],
    help='Only used for PD disaggregation. "prefill" for prefill-only server, and "decode" for decode-only server. If not specified, it is not PD disaggregated',
)
parser.add_argument(
    "--disaggregation-bootstrap-port",
    type=int,
    default=ServerArgs.disaggregation_bootstrap_port,
    help="Bootstrap server port on the prefill server. Default is 8998.",
)
parser.add_argument(
    "--disaggregation-transfer-backend",
    type=str,
    default=ServerArgs.disaggregation_transfer_backend,
    choices=["mooncake", "nixl"],
    help="The backend for disaggregation transfer. Default is mooncake.",
)
parser.add_argument(
    "--disaggregation-ib-device",
    type=str,
    default=ServerArgs.disaggregation_ib_device,
    help="The ib device for disaggregation transfer. Default is None, it will be detected automatically if using the mooncake backend.",
)

mooncake传输引擎安装:

pip install mooncake-transfer-engine

mooncake transfer engine需要设置ib-device。使用ibv_devices命令查看网卡设备用于设置disaggregation-ib-device,但可能不是所有网卡都是RDMA。

nixl最好使用镜像,直接pip安装的可能有问题,nixl不需要像mooncake设置ib-device。

这里演示一个单节点内部不同GPU之间的PD分离部署:

启动Prefill 0:

export CUDA_VISIBLE_DEVICES=0
python3 -m sglang.launch_server \
    --model-path Qwen/Qwen2-1.5B-Instruct \
    --trust-remote-code \
    --mem-fraction-static 0.85 \
    --disaggregation-mode prefill \
    --disaggregation-bootstrap-port 8990 \
    --disaggregation-transfer-backend mooncake \
    --disaggregation-ib-device "mlx5_0,mlx5_1" \
    --page-size 16 \
    --disable-radix-cache \
    --host 0.0.0.0 \
    --port 30000

启动Prefill 1:

export CUDA_VISIBLE_DEVICES=1
python3 -m sglang.launch_server \
    --model-path Qwen/Qwen2-1.5B-Instruct \
    --trust-remote-code \
    --mem-fraction-static 0.85 \
    --disaggregation-mode prefill \
    --disaggregation-bootstrap-port 8991 \
    --disaggregation-transfer-backend mooncake \
    --disaggregation-ib-device "mlx5_0,mlx5_1" \
    --page-size 16 \
    --disable-radix-cache \
    --host 0.0.0.0 \
    --port 30001

启动decoding 0:

export SGLANG_NUM_RESERVED_DECODE_TOKENS=512 
export CUDA_VISIBLE_DEVICES=7
python3 -m sglang.launch_server \
    --model-path Qwen/Qwen2-1.5B-Instruct \
    --trust-remote-code \
    --cuda-graph-bs  1 2 4 8 16 24 32 40 64 80 96 128 144 160 196 256 \
    --mem-fraction-static 0.85 \
    --disaggregation-mode decode \
    --disaggregation-transfer-backend mooncake \
    --disaggregation-ib-device "mlx5_0,mlx5_1" \
    --page-size 16 \
    --disable-radix-cache \
    --host 0.0.0.0 \
    --port 30007

prefill设置MC_TE_METRIC=true环境变量可以打印mooncake transfer engine的传输性能。

启动load balancer

0.4.6.post5把分隔从逗号改成了空格,例如: 

python3 -m sglang.srt.disaggregation.mini_lb \
    --port 8000 \
    --prefill http://localhost:30000 http://localhost:30001 \
    --prefill-bootstrap-ports 8990 8991 \
    --decode http://localhost:30007

需要注意的是同一个节点内PD要设置不同的端口号,不同节点无所谓。

支持xPyD,不同实例端口和prefill-bootstrap-ports用逗号隔开。

测试是否成功:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen2-1.5B-Instruct", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

这里curl是与mini_lb暴露的端口进行通信。

Logo

免费领 200 小时云算力,进群参与显卡、AI PC 幸运抽奖

更多推荐