vllm -- 调用vllm(离线方式) (小白级教程)

HY_will

450人浏览 · 2026-05-14 18:27:54

HY_will · 2026-05-14 18:27:54 发布

上一篇介绍完如何在cpu下安装vllm，接下来介绍如何调用vllm。

参照官方文档，调用vllm的demo代码分为两大类。

第一类: 直接在python代码内部调用vllm，即: 离线方式。

第二类: 通过http调用vllm，即: 在线方式。

掌握了第二种方式，就可以构建一个简单的大模型应用了。

今天我们只介绍离线方式，代码也很简单。

离线方式

离线方式通俗点解释: 在代码里直接调用vllm框架提供的python接口，不需要启动独立的模型推理服务，就好比模型直接在你的代码里运行。

运行第一个示例:

在你的源代码下执行

python examples/basic/offline_inference/basic.py

注: 一定要激活你的uv环境 source .venv/bin/activate(替换成你的路径)

看到如下图结果，说明运行成功!!!

来看代码

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)


def main():
    llm = LLM(model="facebook/opt-125m")
    outputs = llm.generate(prompts, sampling_params)

    print("\nGenerated Outputs:\n" + "-" * 60)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt:    {prompt!r}")
        print(f"Output:    {generated_text!r}")
        print("-" * 60)


if __name__ == "__main__":
    main()

逐行解释源码:

from vllm import LLM, SamplingParams

导入 vLLM 的大模型推理类和采样参数类，类比c++中引入头文件，可以在本代码中使用LLM和SamplingParams类

prompts = [ "Hello, my name is",

"The president of the United States is",

"The capital of France is",

"The future of AI is", ]

声明列表变量(提示词)，用于模型的输入

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

创建采样参数对象，用来控制AI说话的“创造力” 和 “随机性”。

temperature: 控制生成的 “随机程度 / 创意程度”

top_p: 控制 AI 只从 “概率最高的 95% 单词” 里选，通俗理解选词范围

llm = LLM(model="facebook/opt-125m")

创建大模型推理实例（使用 facebook/opt-125m 模型）

outputs = llm.generate(prompts, sampling_params)

让大模型生成提示词文本

for output in outputs:

prompt = output.prompt # 输入提示词

generated_text = output.outputs[0].text # 模型生成的文本

print(f"Prompt: {prompt!r}")

print(f"Output: generated_text!r") print("-" * 60)

遍历结果

加入AMD AI开发者计划！

免费领 200 小时云算力，进群参与显卡、AI PC 幸运抽奖

更多推荐

VLLMService Operator 开发第六篇：给模型服务增加 Service 自动创建能力

做到这里，VLLMService Operator 已经不只是创建 Deployment 了，它开始具备完整服务编排的雏形。-> Pod-> Pod-> Service这一步非常关键，因为 Service 是后续接入流量入口的基础。没有 Service，HTTPRoute 就没有稳定的后端目标；-> Pod-> Service-> Gateway也就是说，Service 是从“模型 Pod 能跑起

AMD开发者中国社区

VLLMService Operator 开发第五篇：部署 Operator 并验证模型服务

存储准备完成后，就可以创建 VLLMService 资源了。metadata:spec:labels:port: 8000resources:requests:cpu: "2"limits:cpu: "4"storage:这里需要明确一点：VLLMService 本身只是用户声明的期望状态，它并不会直接运行模型。真正运行模型的是 Operator 根据这个 VLLMService 自动创建出来的

AMD开发者中国社区

Higress v2.2.3 发布：AI Gateway 能力增强，Gateway API 及其推理扩展持续打磨

Higress v2.2.3 版本发布，带来多项核心更新：新增 AI Gateway 的 vLLM 协议透传和上下文限制插件，增强 AI 安全防护与负载均衡；优化 Gateway API 隔离与推理路由配置；改进 Ingress 迁移体验；加固安全认证组件与 TLS 校验；修复多项 Console 问题。社区层面，Higress 正式入驻 CNCF Sandbox。