vLLM 项目运行指南

uv极快：比 pip 快 10-100 倍可靠：依赖解析更准确现代：支持 Python 3.12+ 的最新特性兼容：完全兼容 pip 和 requirements.txt步骤命令耗时状态安装 uv~10s✅创建虚拟环境~30s✅安装 lint 工具~20s✅安装 vLLM~10min✅安装测试依赖~5min✅验证安装~5s✅运行测试~2min✅总耗时：约 20 分钟最终状态：✅ 安装成功，所有核心

lqqjuly

232人浏览 · 2026-05-18 14:07:33

lqqjuly · 2026-05-18 14:07:33 发布

本文档详细记录了 vLLM 项目的完整安装和运行过程

1. 环境信息

1.1 硬件环境

组件	规格
GPU	4 × NVIDIA GeForce RTX 4090 (24GB)
GPU 架构	Ada Lovelace (Compute Capability 8.9)
CUDA 版本	12.4 (驱动) / 12.9 (PyTorch)
操作系统	Linux 6.8.0-111-generic

1.2 软件环境

组件	版本
Python	3.12.13 (通过 uv 安装)
PyTorch	2.11.0+cu129
vLLM	0.20.2rc1.dev372 (开发版)
uv	0.11.14

1.3 检查命令

# 检查 GPU 信息
nvidia-smi

# 检查 Python 版本
python3 --version

# 检查 CUDA 版本
nvcc --version

原因：在开始安装前，需要确认硬件和软件环境是否满足 vLLM 的要求。vLLM 需要 NVIDIA GPU 和 CUDA 支持。

2. 安装 uv 包管理器

2.1 什么是 uv？

uv 是 Astral 公司开发的 Python 包管理器，特点：

极快：比 pip 快 10-100 倍
可靠：依赖解析更准确
现代：支持 Python 3.12+ 的最新特性
兼容：完全兼容 pip 和 requirements.txt

2.2 安装命令

curl -LsSf https://astral.sh/uv/install.sh | sh

命令解析：

curl -LsSf：下载安装脚本
- -L：跟随重定向
- -s：静默模式
- -S：显示错误
- -f：失败时返回错误码
| sh：将下载的脚本传递给 shell 执行

2.3 配置 PATH

source $HOME/.local/bin/env

原因：uv 安装到 ~/.local/bin，需要将其添加到 PATH 环境变量中。

2.4 验证安装

uv --version

预期输出：

uv 0.11.14

3. 创建 Python 虚拟环境

3.1 为什么使用虚拟环境？

隔离性：避免不同项目的依赖冲突
可重复性：确保环境一致
安全性：不会污染系统 Python

3.2 创建命令

uv venv --python 3.12

命令解析：

uv venv：创建虚拟环境
--python 3.12：指定 Python 版本为 3.12

原因：vLLM 官方推荐使用 Python 3.12，因为：

性能更好
支持最新的类型注解语法
与 PyTorch 2.11 兼容性最佳

3.3 激活虚拟环境

source .venv/bin/activate

原因：激活后，后续的 pip/python 命令都会使用虚拟环境中的版本。

3.4 验证环境

which python
python --version

预期输出：

/home/ms/lm/code/vllm/.venv/bin/python
Python 3.12.13

4. 安装代码规范工具

4.1 安装 lint 依赖

uv pip install -r requirements/lint.txt

命令解析：

uv pip install：使用 uv 安装 Python 包
-r requirements/lint.txt：从文件读取依赖列表

requirements/lint.txt 包含：

pre-commit：Git hooks 管理工具
ruff：Python linter（替代 flake8、isort、black）
mypy：静态类型检查

4.2 安装 pre-commit hooks

pre-commit install

原因：pre-commit hooks 会在每次 git commit 前自动运行代码检查，确保代码质量。

安装的 hooks：

pre-commit：提交前检查
commit-msg：提交信息格式检查

4.3 验证安装

pre-commit --version

5. 安装 vLLM 主库

5.1 安装命令

VLLM_USE_PRECOMPILED=1 uv pip install -e . --torch-backend=auto

命令解析：

部分	含义
`VLLM_USE_PRECOMPILED=1`	使用预编译的 C 扩展，避免从源码编译
`uv pip install`	使用 uv 安装
`-e .`	以可编辑模式安装当前目录（开发模式）
`--torch-backend=auto`	自动选择 PyTorch 后端（CUDA/CPU/ROCm）

5.2 为什么使用预编译？

预编译模式 (VLLM_USE_PRECOMPILED=1)：

安装速度快（几分钟）
不需要编译环境（CUDA toolkit、gcc 等）
包含所有必要的 C 扩展
⚠️ 不能修改 C/C++ 代码

源码编译模式：

可以修改 C/C++ 代码
⚠️ 安装时间长（可能需要 30+ 分钟）
⚠️ 需要完整的编译环境

5.3 安装过程

安装过程中会下载以下关键组件：

组件	大小	说明
`torch`	~2GB	PyTorch 深度学习框架
`vllm`	~100MB	vLLM Python 代码
`_C.abi3.so`	431.7MB	核心 C 扩展
`_moe_C.abi3.so`	285.2MB	MoE（混合专家）内核
`_C_stable_libtorch.abi3.so`	64.7MB	LibTorch 绑定
`_flashmla_C.abi3.so`	6.3MB	FlashMLA 内核
`flashinfer-python`	~300MB	FlashInfer 注意力后端

5.4 验证安装

python -c "import vllm; print(vllm.__version__)"

预期输出：

0.20.2rc1.dev372+ge30f39c4f

6. 安装测试依赖

6.1 安装命令

uv pip install -r requirements/test/cuda.in

原因：测试依赖包含 pytest、测试工具和模拟数据等，用于验证安装是否正确。

6.2 测试依赖包含

包	用途
`pytest`	测试框架
`pytest-asyncio`	异步测试支持
`pytest-timeout`	测试超时控制
`hypothesis`	属性基测试
`ray`	分布式计算框架
`transformers`	HuggingFace 模型库

7. 验证安装

7.1 核心模块导入测试

.venv/bin/python -c "
import torch
print('=== vLLM Installation Verification ===')
print()

# 1. PyTorch + CUDA
print('1. PyTorch + CUDA:')
print(f'   PyTorch: {torch.__version__}')
print(f'   CUDA: {torch.version.cuda}')
print(f'   GPU count: {torch.cuda.device_count()}')
print()

# 2. vLLM version
import vllm
print(f'2. vLLM version: {vllm.__version__}')
print()

# 3. Core imports
print('3. Core module imports:')
modules = [
    ('vllm.LLM', 'vllm', 'LLM'),
    ('vllm.SamplingParams', 'vllm', 'SamplingParams'),
    ('vllm.config.VllmConfig', 'vllm.config', 'VllmConfig'),
    ('vllm.engine.arg_utils.EngineArgs', 'vllm.engine.arg_utils', 'EngineArgs'),
    ('vllm.sampling_params.SamplingParams', 'vllm.sampling_params', 'SamplingParams'),
    ('vllm.model_executor', 'vllm.model_executor', None),
    ('vllm._custom_ops', 'vllm._custom_ops', None),
]

for name, module, attr in modules:
    try:
        mod = __import__(module, fromlist=[attr] if attr else [])
        if attr:
            getattr(mod, attr)
        print(f'   {name}: OK')
    except Exception as e:
        print(f'   {name}: FAILED ({e})')
print()

# 4. C extensions
print('4. C extensions (.so files):')
import os
so_files = [f for f in os.listdir('vllm') if f.endswith('.so')]
for f in so_files:
    size_mb = os.path.getsize(f'vllm/{f}') / (1024*1024)
    print(f'   {f}: {size_mb:.1f} MB')
print()

# 5. CUDA ops test
print('5. CUDA operations:')
x = torch.randn(512, 512, device='cuda')
y = torch.randn(512, 512, device='cuda')
z = torch.mm(x, y)
print(f'   Matrix multiplication: {z.shape} - OK')
print()

print('=== Installation is complete and functional! ===')
"

预期输出：

=== vLLM Installation Verification ===

1. PyTorch + CUDA:
   PyTorch: 2.11.0+cu129
   CUDA: 12.9
   GPU count: 4

2. vLLM version: 0.20.2rc1.dev372+ge30f39c4f

3. Core module imports:
   vllm.LLM: OK
   vllm.SamplingParams: OK
   vllm.config.VllmConfig: OK
   vllm.engine.arg_utils.EngineArgs: OK
   vllm.sampling_params.SamplingParams: OK
   vllm.model_executor: OK
   vllm._custom_ops: OK

4. C extensions (.so files):
   _flashmla_extension_C.abi3.so: 0.8 MB
   _C_stable_libtorch.abi3.so: 64.7 MB
   _moe_C.abi3.so: 285.2 MB
   cumem_allocator.abi3.so: 0.0 MB
   _flashmla_C.abi3.so: 6.3 MB
   _C.abi3.so: 431.7 MB
   spinloop.abi3.so: 0.0 MB

5. CUDA operations:
   Matrix multiplication: torch.Size([512, 512]) - OK

=== Installation is complete and functional! ===

8. 运行单元测试

8.1 运行离线测试

.venv/bin/python -m pytest tests/test_logger.py tests/test_scalartype.py tests/test_sequence.py -v --timeout=60

命令解析：

.venv/bin/python：使用虚拟环境中的 Python
-m pytest：运行 pytest 模块
tests/test_logger.py：指定测试文件
-v：详细输出
--timeout=60：每个测试最多运行 60 秒

8.2 测试结果

============================= test session starts ==============================
platform linux -- Python 3.12.13, pytest-9.0.3
collected 35 items

tests/test_logger.py::test_trace_function_call PASSED                    [  2%]
tests/test_logger.py::test_default_vllm_root_logger_configuration PASSED [  5%]
tests/test_logger.py::test_descendent_loggers_depend_on_and_propagate_logs_to_root_logger PASSED [  8%]
tests/test_logger.py::test_logger_configuring_can_be_disabled PASSED     [ 11%]
tests/test_logger.py::test_an_error_is_raised_when_custom_logging_config_file_does_not_exist PASSED [ 14%]
tests/test_logger.py::test_an_error_is_raised_when_custom_logging_config_is_invalid_json PASSED [ 17%]
tests/test_logger.py::test_an_error_is_raised_when_custom_logging_config_is_unexpected_json[Invalid string] PASSED [ 20%]
tests/test_logger.py::test_an_error_is_raised_when_custom_logging_config_is_unexpected_json[unexpected_config1] PASSED [ 22%]
tests/test_logger.py::test_an_error_is_raised_when_custom_logging_config_is_unexpected_json[0] PASSED [ 25%]
tests/test_logger.py::test_custom_logging_config_is_parsed_and_used_when_provided PASSED [ 28%]
tests/test_logger.py::test_custom_logging_config_causes_an_error_if_configure_logging_is_off PASSED [ 31%]
tests/test_logger.py::test_prepare_object_to_dump PASSED                 [ 34%]
tests/test_logger.py::test_request_logger_log_outputs PASSED             [ 37%]
tests/test_logger.py::test_request_logger_log_outputs_streaming_delta PASSED [ 40%]
tests/test_logger.py::test_request_logger_log_outputs_streaming_complete PASSED [ 42%]
tests/test_logger.py::test_request_logger_log_outputs_with_truncation PASSED [ 45%]
tests/test_logger.py::test_request_logger_log_outputs_none_values PASSED [ 48%]
tests/test_logger.py::test_request_logger_log_outputs_empty_output PASSED [ 51%]
tests/test_logger.py::test_request_logger_log_outputs_integration PASSED [ 54%]
tests/test_logger.py::test_streaming_complete_logs_full_text_content PASSED [ 57%]
tests/test_logger.py::test_caplog_mp_fork PASSED                         [ 60%]
tests/test_logger.py::test_caplog_mp_spawn PASSED                        [ 62%]
tests/test_scalartype.py::test_scalar_type_min_max[(-8, 7, ScalarType.int4)] PASSED [ 65%]
tests/test_scalartype.py::test_scalar_type_min_max[(0, 15, ScalarType.uint4)] PASSED [ 68%]
tests/test_scalartype.py::test_scalar_type_min_max[(-8, 7, ScalarType.uint4b8)] PASSED [ 71%]
tests/test_scalartype.py::test_scalar_type_min_max[(-128, 127, ScalarType.uint8b128)] PASSED [ 74%]
tests/test_scalartype.py::test_scalar_type_min_max[(-6.0, 6.0, ScalarType.float4_e2m1f)] PASSED [ 77%]
tests/test_scalartype.py::test_scalar_type_min_max[(-28.0, 28.0, ScalarType.float6_e3m2f)] PASSED [ 80%]
tests/test_scalartype.py::test_scalar_type_min_max[(torch.int8, ScalarType.int8)] PASSED [ 82%]
tests/test_scalartype.py::test_scalar_type_min_max[(torch.uint8, ScalarType.uint8)] PASSED [ 85%]
tests/test_scalartype.py::test_scalar_type_min_max[(torch.float8_e5m2, ScalarType.float8_e5m2n)] PASSED [ 88%]
tests/test_scalartype.py::test_scalar_type_min_max[(torch.float8_e4m3fn, ScalarType.float8_e4m3fn)] PASSED [ 91%]
tests/test_scalartype.py::test_scalar_type_min_max[(torch.bfloat16, ScalarType.float16_e8m7n)] PASSED [ 94%]
tests/test_scalartype.py::test_scalar_type_min_max[(torch.float16, ScalarType.float16_e5m10n)] PASSED [ 97%]
tests/test_sequence.py::test_sequence_intermediate_tensors_equal PASSED  [100%]

======================== 35 passed, 17 warnings in 19.33s ========================

8.3 运行更多测试

.venv/bin/python -m pytest tests/test_envs.py tests/test_vllm_port.py tests/test_access_log_filter.py tests/test_jit_monitor.py tests/test_embedded_commit.py tests/test_seed_behavior.py -v --timeout=30

测试结果：

======================== 88 passed, 1 failed, 16 warnings in 24.90s ========================

失败的测试：

test_seed_behavior.py：Platform.seed_everything 方法不存在（API 变化，非安装问题）

9. CUDA 性能测试

9.1 测试命令

.venv/bin/python -c "
import torch
import time

print('=== CUDA Performance Test ===')
print()

# 测试 GPU 信息
print('GPU Information:')
for i in range(torch.cuda.device_count()):
    props = torch.cuda.get_device_properties(i)
    print(f'  GPU {i}: {props.name}')
    print(f'    Memory: {props.total_memory / 1024**3:.1f} GB')
    print(f'    Compute Capability: {props.major}.{props.minor}')
print()

# 测试矩阵运算性能
print('Matrix Multiplication Performance:')
sizes = [1024, 2048, 4096]
for size in sizes:
    a = torch.randn(size, size, device='cuda')
    b = torch.randn(size, size, device='cuda')
    
    # Warmup
    for _ in range(10):
        c = torch.mm(a, b)
    torch.cuda.synchronize()
    
    # Benchmark
    start = time.time()
    for _ in range(100):
        c = torch.mm(a, b)
    torch.cuda.synchronize()
    elapsed = time.time() - start
    
    print(f'  {size}x{size}: {elapsed*1000/100:.2f} ms per operation')
print()

print('=== All tests passed! vLLM installation is complete. ===')
"

9.2 测试结果

=== CUDA Performance Test ===

GPU Information:
  GPU 0: NVIDIA GeForce RTX 4090
    Memory: 23.6 GB
    Compute Capability: 8.9
  GPU 1: NVIDIA GeForce RTX 4090
    Memory: 23.6 GB
    Compute Capability: 8.9
  GPU 2: NVIDIA GeForce RTX 4090
    Memory: 23.6 GB
    Compute Capability: 8.9
  GPU 3: NVIDIA GeForce RTX 4090
    Memory: 23.6 GB
    Compute Capability: 8.9

Matrix Multiplication Performance:
  1024x1024: 0.05 ms per operation
  2048x2048: 0.33 ms per operation
  4096x4096: 2.64 ms per operation

=== All tests passed! vLLM installation is complete. ===

10. 常见问题与解决方案

10.1 网络问题

问题：无法连接到 huggingface.co

症状：

'[Errno 101] Network is unreachable' thrown while requesting HEAD https://huggingface.co/...

解决方案：

检查网络连接

配置代理：

export HTTP_PROXY=http://proxy:port
export HTTPS_PROXY=http://proxy:port

使用离线模式：
```
export HF_HUB_OFFLINE=1
```

预下载模型到本地：

huggingface-cli download meta-llama/Llama-3-8B-Instruct --local-dir ./models/Llama-3-8B-Instruct

10.2 CUDA 版本不匹配

问题：PyTorch CUDA 版本与系统 CUDA 版本不匹配

解决方案：

# 检查系统 CUDA 版本
nvcc --version

# 检查 PyTorch CUDA 版本
python -c "import torch; print(torch.version.cuda)"

# 如果不匹配，重新安装 PyTorch
uv pip install torch --torch-backend=auto

10.3 内存不足

问题：GPU 内存不足

解决方案：

from vllm import LLM

# 降低 GPU 内存使用率
llm = LLM(model="model-name", gpu_memory_utilization=0.8)

# 使用量化模型
llm = LLM(model="model-name", quantization="awq")

10.4 预编译包不包含最新代码

问题：修改了 C/C++ 代码后，预编译包不生效

解决方案：

# 卸载预编译包
uv pip uninstall vllm

# 从源码编译安装
uv pip install -e . --torch-backend=auto

11. 完整安装脚本

以下是一个完整的安装脚本，可以直接复制执行：

#!/bin/bash
set -e

echo "=== Step 1: Install uv ==="
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

echo "=== Step 2: Create virtual environment ==="
uv venv --python 3.12
source .venv/bin/activate

echo "=== Step 3: Install lint tools ==="
uv pip install -r requirements/lint.txt
pre-commit install

echo "=== Step 4: Install vLLM ==="
VLLM_USE_PRECOMPILED=1 uv pip install -e . --torch-backend=auto

echo "=== Step 5: Install test dependencies ==="
uv pip install -r requirements/test/cuda.in

echo "=== Step 6: Verify installation ==="
python -c "
import vllm
import torch
print(f'vLLM version: {vllm.__version__}')
print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU count: {torch.cuda.device_count()}')
print('Installation complete!')
"

echo "=== Step 7: Run tests ==="
python -m pytest tests/test_logger.py tests/test_scalartype.py -v --timeout=60

echo "=== Installation complete! ==="

12. 后续步骤

12.1 运行推理（需要网络）

from vllm import LLM, SamplingParams

# 加载模型
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")

# 生成文本
outputs = llm.generate(
    ["Hello, my name is"],
    SamplingParams(temperature=0.8, max_tokens=50)
)
print(outputs[0].outputs[0].text)

12.2 启动 API 服务

# 启动 OpenAI 兼容的 API 服务
vllm serve meta-llama/Llama-3-8B-Instruct --port 8000

# 使用 curl 测试
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3-8B-Instruct",
    "prompt": "Hello, my name is",
    "max_tokens": 50
  }'

12.3 使用本地模型

# 下载模型
huggingface-cli download meta-llama/Llama-3-8B-Instruct --local-dir ./models/Llama-3-8B-Instruct

# 使用本地模型
vllm serve ./models/Llama-3-8B-Instruct --port 8000

13. 附录

13.1 关键文件说明

文件/目录	说明
`vllm/`	主库代码
`vllm/v1/`	v1 引擎实现
`vllm/v1/core/`	调度器和 KV Cache 管理
`vllm/v1/attention/`	注意力后端实现
`vllm/model_executor/`	模型执行器
`vllm/entrypoints/`	API 入口点
`tests/`	测试代码
`requirements/`	依赖文件
`setup.py`	安装脚本
`pyproject.toml`	项目配置

13.2 环境变量

变量	说明
`VLLM_USE_PRECOMPILED=1`	使用预编译包
`VLLM_WORKER_MULTIPROC_METHOD=spawn`	多进程启动方式
`CUDA_VISIBLE_DEVICES=0,1`	可见 GPU
`HF_HUB_OFFLINE=1`	HuggingFace 离线模式
`HF_TOKEN=xxx`	HuggingFace 访问令牌

13.3 性能调优

from vllm import LLM

llm = LLM(
    model="model-name",
    gpu_memory_utilization=0.9,  # GPU 内存使用率
    max_num_batched_tokens=8192,  # 最大批处理 token 数
    max_num_seqs=256,  # 最大并发序列数
    tensor_parallel_size=4,  # 张量并行数
    pipeline_parallel_size=1,  # 流水线并行数
)

总结

步骤	命令	耗时	状态
安装 uv	`curl -LsSf https://astral.sh/uv/install.sh \| sh`	~10s	✅
创建虚拟环境	`uv venv --python 3.12`	~30s	✅
安装 lint 工具	`uv pip install -r requirements/lint.txt`	~20s	✅
安装 vLLM	`VLLM_USE_PRECOMPILED=1 uv pip install -e .`	~10min	✅
安装测试依赖	`uv pip install -r requirements/test/cuda.in`	~5min	✅
验证安装	`python -c "import vllm"`	~5s	✅
运行测试	`pytest tests/`	~2min	✅

总耗时：约 20 分钟

最终状态：✅ 安装成功，所有核心功能正常

加入AMD AI开发者计划！

免费领 50 小时云算力，进群参与显卡、AI PC 幸运抽奖

更多推荐

PyTorch CUDA设备不可用错误解决方案

本文提供了PyTorch CUDA设备不可用错误的全面解决方案。主要内容包括：1) 快速解决方法（强制加载到CPU）；2) 根本解决方案（检查修复GPU环境）；3) 最佳实践（使用state_dict方式）。文章还涵盖特殊设备处理（AMD显卡和Apple M芯片）、完整解决方案代码、预防措施和常见问题排查。关键建议是使用state_dict方式保存和加载模型，以确保最佳兼容性。适用于遇到PyTor

AMD开发者中国社区

企业AI中台崩溃前夜（2026工具栈兼容性危机实录）：Python 3.13/PyTorch 2.6/Triton 3.0协同失效全复盘

2026年AI工具栈搭建完整指南：直击Python 3.13、PyTorch 2.6与Triton 3.0协同失效痛点，提供企业AI中台兼容性修复方案、版本降级策略及容器化隔离实践。覆盖模型训练、推理部署全链路，保障生产环境稳定性，值得收藏。

AMD开发者中国社区

Docker部署Ollama模型

对于支持事务的 Sink（如文件系统、Iceberg），需要一个全局的 Committer 来在 Checkpoint 完成时统一提交事务（二阶段提交），从而实现 Exactly-Once（精确一次）语义。支持列表: MySQL-CDC, PostgreSQL-CDC, Oracle-CDC, MongoDB-CDC, SQLServer-CDC, TiDB-CDC 等。无状态转换：大多数 Tra

AMD开发者中国社区

所有评论(0)

查看更多评论

lqqjuly

@qq_45998729

已为社区贡献1条内容

vLLM 项目运行指南

lqqjuly

目录

1. 环境信息

1.1 硬件环境

1.2 软件环境

1.3 检查命令

2. 安装 uv 包管理器

2.1 什么是 uv？

2.2 安装命令

2.3 配置 PATH

2.4 验证安装

3. 创建 Python 虚拟环境

3.1 为什么使用虚拟环境？

3.2 创建命令

3.3 激活虚拟环境

3.4 验证环境

4. 安装代码规范工具

4.1 安装 lint 依赖

4.2 安装 pre-commit hooks

4.3 验证安装

5. 安装 vLLM 主库

5.1 安装命令

5.2 为什么使用预编译？

5.3 安装过程

5.4 验证安装

6. 安装测试依赖

6.1 安装命令

6.2 测试依赖包含

7. 验证安装

7.1 核心模块导入测试

8. 运行单元测试

8.1 运行离线测试

8.2 测试结果

8.3 运行更多测试

9. CUDA 性能测试

9.1 测试命令

9.2 测试结果

10. 常见问题与解决方案

10.1 网络问题

10.2 CUDA 版本不匹配

10.3 内存不足

10.4 预编译包不包含最新代码

11. 完整安装脚本

12. 后续步骤

12.1 运行推理（需要网络）

12.2 启动 API 服务

12.3 使用本地模型

13. 附录

13.1 关键文件说明

13.2 环境变量

13.3 性能调优

总结

所有评论(0)

温馨提示：您尚未绑定手机号

lqqjuly