1.模型介绍

  • 这是 Qwen3.6-35B-A3B 的社区 AWQ 4-bit 无校准量化版,25GB 左右,单/多卡消费级或中端数据中心卡就能跑,适合想本地或私有化部署 Qwen3.6 MoE、又扛不动 FP8 原版显存的同学。代价是 data-free 量化没走校准,极端任务可能比原版掉一点点,但 Agentic Coding / 多模态日常用基本感知不明显。

2.模型部署工具vllm

docker pull vllm/vllm-openai:latest

3.模型下载

  • 魔搭国内社区

https://www.modelscope.cn/models/tclf90/Qwen3.6-35B-A3B-AWQ

  • 使用vllm docker容器下载模型
docker run --rm -it \
    --gpus all \
    --network host \
    --entrypoint /bin/bash \
    --pids-limit -1 \
    --security-opt seccomp=unconfined \
    -v /root/lipengcheng/qwen36_35b:/models \
    -e OMP_NUM_THREADS=8 \
    vllm/vllm-openai:latest \
    -c "pip install modelscope && python3 -c \"from modelscope import snapshot_download; snapshot_download('tclf90/Qwen3.6-35B-A3B-AWQ', cache_dir='/models')\""

4.模型部署

version: '3.8'

services:
  vllm-qwen36-moe:
    image: vllm/vllm-openai:latest
    container_name: vllm-qwen3.6-35b-a3b-awq
    privileged: true
    network_mode: "host"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2  # 吃满两张 T4
              capabilities: [gpu]
    volumes:
      # 左边已经精准替换为你刚才 pwd 出来的绝对路径
      - /root/lipengcheng/qwen36_35b/tclf90/Qwen3___6-35B-A3B-AWQ:/models
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - OMP_NUM_THREADS=8
      - VLLM_USE_MODELSCOPE=true
    command: >
      /models
      --host 0.0.0.0
      --port 23333
      --dtype half
      --served-model-name Qwen3.6-35B-A3B-AWQ
      --tensor-parallel-size 2
      --quantization awq_marlin
      --trust-remote-code
      --gpu-memory-utilization 0.96
      --max-model-len 65536
      --max-num-seqs 32
      --enable-prefix-caching
      --enable-chunked-prefill
      --enable-expert-parallel
      --reasoning-parser qwen3
      --tool-call-parser qwen3_coder
      --enable-auto-tool-choice
    restart: unless-stopped

5.模型测试

  • 基础测试
curl http://localhost:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.6-35B-A3B-AWQ",
    "messages": [
      {"role": "user", "content": "你是谁?"}
    ],
    "max_tokens": 128,
    "temperature": 0.7
  }'

  • 流式测试
curl http://localhost:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.6-35B-A3B-AWQ",
    "messages": [
      {"role": "user", "content": "用 Python 写一个快速排序"}
    ],
    "stream": true
  }'

  • 长上下文 & Thinking 模式测试
curl http://localhost:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.6-35B-A3B-AWQ",
    "messages": [
      {"role": "system", "content": "/think"},
      {"role": "user", "content": "一步步推导:如果 A 比 B 快,B 比 C 快,A 和 C 谁快?"}
    ]
  }'

  • Tool Calling / Agent 测试(Qwen3 Coder)
curl http://localhost:23333/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.6-35B-A3B-AWQ",
    "messages": [
      {"role": "user", "content": "帮我查一下当前目录下的文件"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "list_files",
          "description": "列出目录文件",
          "parameters": {
            "type": "object",
            "properties": {}
          }
        }
      }
    ]
  }'

6.总结

  • 模型最大可以跑到 57 tokens/s,速度还是不错的,日常的文案输出,普通编码的生成等还是够用的

更多推荐