Qwen3.6-35B-A3B-AWQ 模型本地部署
·
1.模型介绍
- 这是 Qwen3.6-35B-A3B 的社区 AWQ 4-bit 无校准量化版,25GB 左右,单/多卡消费级或中端数据中心卡就能跑,适合想本地或私有化部署 Qwen3.6 MoE、又扛不动 FP8 原版显存的同学。代价是 data-free 量化没走校准,极端任务可能比原版掉一点点,但 Agentic Coding / 多模态日常用基本感知不明显。
2.模型部署工具vllm
docker pull vllm/vllm-openai:latest
3.模型下载
- 魔搭国内社区
https://www.modelscope.cn/models/tclf90/Qwen3.6-35B-A3B-AWQ

- 使用vllm docker容器下载模型
docker run --rm -it \
--gpus all \
--network host \
--entrypoint /bin/bash \
--pids-limit -1 \
--security-opt seccomp=unconfined \
-v /root/lipengcheng/qwen36_35b:/models \
-e OMP_NUM_THREADS=8 \
vllm/vllm-openai:latest \
-c "pip install modelscope && python3 -c \"from modelscope import snapshot_download; snapshot_download('tclf90/Qwen3.6-35B-A3B-AWQ', cache_dir='/models')\""
4.模型部署
version: '3.8'
services:
vllm-qwen36-moe:
image: vllm/vllm-openai:latest
container_name: vllm-qwen3.6-35b-a3b-awq
privileged: true
network_mode: "host"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2 # 吃满两张 T4
capabilities: [gpu]
volumes:
# 左边已经精准替换为你刚才 pwd 出来的绝对路径
- /root/lipengcheng/qwen36_35b/tclf90/Qwen3___6-35B-A3B-AWQ:/models
environment:
- NVIDIA_VISIBLE_DEVICES=all
- OMP_NUM_THREADS=8
- VLLM_USE_MODELSCOPE=true
command: >
/models
--host 0.0.0.0
--port 23333
--dtype half
--served-model-name Qwen3.6-35B-A3B-AWQ
--tensor-parallel-size 2
--quantization awq_marlin
--trust-remote-code
--gpu-memory-utilization 0.96
--max-model-len 65536
--max-num-seqs 32
--enable-prefix-caching
--enable-chunked-prefill
--enable-expert-parallel
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--enable-auto-tool-choice
restart: unless-stopped
5.模型测试
- 基础测试
curl http://localhost:23333/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.6-35B-A3B-AWQ",
"messages": [
{"role": "user", "content": "你是谁?"}
],
"max_tokens": 128,
"temperature": 0.7
}'

- 流式测试
curl http://localhost:23333/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.6-35B-A3B-AWQ",
"messages": [
{"role": "user", "content": "用 Python 写一个快速排序"}
],
"stream": true
}'

- 长上下文 & Thinking 模式测试
curl http://localhost:23333/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.6-35B-A3B-AWQ",
"messages": [
{"role": "system", "content": "/think"},
{"role": "user", "content": "一步步推导:如果 A 比 B 快,B 比 C 快,A 和 C 谁快?"}
]
}'

- Tool Calling / Agent 测试(Qwen3 Coder)
curl http://localhost:23333/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.6-35B-A3B-AWQ",
"messages": [
{"role": "user", "content": "帮我查一下当前目录下的文件"}
],
"tools": [
{
"type": "function",
"function": {
"name": "list_files",
"description": "列出目录文件",
"parameters": {
"type": "object",
"properties": {}
}
}
}
]
}'

6.总结
- 模型最大可以跑到 57 tokens/s,速度还是不错的,日常的文案输出,普通编码的生成等还是够用的
更多推荐
所有评论(0)