vllm 加速推理通义千问Qwen经验总结

curl请求 curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen-14B-Chat-Int4-hf", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0

文章共646字 · 阅读需要大约3分钟

一键AI生成摘要，助你高效阅读

问答

Charles_yy

4556人浏览 · 2023-12-21 15:31:00

Charles_yy · 2023-12-21 15:31:00 发布

1. 简介

1.1. 功能说明

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Optimized CUDA kernels

vLLM is flexible and easy to use with:

- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server

1.2. GitHub项目

原始项目：https://github.com/vllm-project/vllm （支持awq量化，暂不支持gptq量化）

拓展项目：https://github.com/chu-tianxiang/vllm-gptq （支持gptq量化）

2. 架构

官方文档：vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog

中文文档：vLLM框架原理——PagedAttention - 知乎

2.1. 官方测试性能

2.2. 主要功能说明

2.2.1. PagedAttention

问题：由于碎片化和过度预留，现有的系统浪费了60%-80%的内存

解决方法：将序列中token按固定长度划分为多个块，并与系统内存进行映射，解决不连续填满问题。

亮点：高效内存利用和共享

高效内存利用示意图：

physical blocks 是横坐标，filled slots是纵坐标

高效内存共享示意图：

3. 实现方案

如果要用gptq量化需要用：https://github.com/chu-tianxiang/vllm-gptq

pip install -e .

3.1. 离线模式

pip install vllm

from vllm import LLM

prompts = ["Hello, my name is", "The capital of France is"]  # Sample prompts.
llm = LLM(model="/data/jupyter/LLM/models/Qwen-14B-Chat-Int4-hf", 
          trust_remote_code=True, 
          quantization="gptq",
          gpu_memory_utilization=0.5,
         )  # Create an LLM.
outputs = llm.generate(prompts)  # Generate texts from the prompts.

3.2. 服务模型

# 指定模型名称或模型路径
CUDA_VISIBLE_DEVICES=3 python -m vllm.entrypoints.openai.api_server \
--model /data/jupyter/LLM/models/Qwen-14B-Chat-Int4-hf/ \
--trust-remote-code \
--port 30003 \
--gpu-memory-utilization  0.5 \
--tensor-parallel-size 1 \
--served-model-name Qwen/Qwen-14B-Chat-Int4-hf \
--quantization gptq

# curl请求
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen-14B-Chat-Int4-hf",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'