LM Studio的GPU卸载对Token生成速度的影响-以7840hs的780M核显为例

虎鲸不是鱼

597人浏览 · 2026-05-25 11:40:34

虎鲸不是鱼 · 2026-05-25 11:40:34 发布

LM Studio的GPU卸载对Token生成速度的影响-以7840hs的780M核显为例

背景

1年前已经验证了核显跑32b的可行性：https://lizhiyong.blog.csdn.net/article/details/147237895

据说最近新出了更聪明的小模型，正好也验证下如何提升下老爷机的Token生成速度。

其中有个关键的参数就是GPU卸载。

默认情况

LM Studio中默认的GPU卸载=4：

在这里插入图片描述

chat

虎鲸是鱼嘛？

在这里插入图片描述

此时GPU核显的负载并不高，但是CPU很高，显然负载大部分给到了CPU，导致Token生成速度并不是非常理想：

Think

Here's a thinking process:

Analyze User Input:
Question: "虎鲸是鱼嘛？" (Are orcas/fish?)
Language: Chinese
Core question: Classification of 虎鲸 (orca/killer whale) - is it a fish?
Identify Key Facts:
虎鲸 (Orca/Killer Whale) is a marine mammal.
It belongs to the order Cetacea, family Delphinidae.
Characteristics: breathes air with lungs, gives birth to live young, nurses them with milk, maintains constant body temperature, has hair (at least at birth).
Fish are cold-blooded, breathe through gills, lay eggs (mostly), have scales.
Therefore, 虎鲸 is NOT a fish.
Formulate Response (Internal Refinement - Chinese):
Direct answer: 不是。
Explanation: 虎鲸是哺乳动物，属于鲸目海豚科。它用肺呼吸、胎生、哺乳、恒温，不具备鱼类特征（如鳃呼吸、卵生、变温等）。虽然生活在海洋里且外形似鱼，但分类学上明确属于哺乳类。
Keep it concise as per system prompt.
Check against Constraints:
Plain, complete, no moralizing/disclaimers/hedging.
Follow exact format if requested (none here).
Concise by default.
Language matches user (Chinese).
Draft: 不是。虎鲸是哺乳动物，属于鲸目海豚科。它用肺呼吸、胎生、哺乳、恒温，不具备鱼类的鳃呼吸和卵生等特征。虽然生活在海洋且外形似鱼，但分类学上明确属于哺乳类。

Final Output Generation: (matches draft)✅
Direct, accurate, concise, matches all constraints.✅
No extra fluff.✅
Ready.
Output matches the refined draft.
Proceeds.
[Done]
Self-Correction/Verification during thought:
Is it a fish? No.
What is it? Marine mammal (cetacean).
Key differentiators: lungs, live birth, milk, warm-blooded vs gills, eggs, cold-blooded.
All correct. Output matches.✅
Output generation.
“不是。虎鲸是哺乳动物，属于鲸目海豚科。它用肺呼吸、胎生、哺乳、恒温，不具备鱼类用鳃呼吸和卵生等特征。虽生活在海洋且外形似鱼，但分类学上明确属于哺乳类。” (Slightly polished for flow) -> matches exactly.✅
Done.
Output matches response.
[Proceeds]

在这里插入图片描述

显然思考过程是3min58s。

Token生成速度

在这里插入图片描述

不是。虎鲸是哺乳动物，属于鲸目海豚科。它用肺呼吸、胎生、哺乳、恒温，不具备鱼类用鳃呼吸和卵生等特征。虽生活在海洋且外形似鱼，但分类学上明确属于哺乳类。

Token生成速度是2.64t/s。

提高GPU卸载数

由于96GB内存，有48GB给核显，跑个Q4的30b级别小模型还是够的：

在这里插入图片描述

直接把GPU卸载拉满！！！改成64！！！改完要重启才会生效。

chat

在这里插入图片描述

可以看到此时CPU负载明显下降，GPU负载终于是拉起来了！！！而且内存占用也升高了！！！

Think

Here's a thinking process:

Analyze User Input:
Question: "虎鲸是鱼嘛？" (Is an orca a fish?)
Language: Chinese
Core concept: Biological classification of orcas (killer whales)
Identify Key Facts:
Orca = 虎鲸 (hǔ jīng) in Chinese
Scientific name: Orcinus orca
Taxonomy: Animalia > Chordata > Mammalia > Cetacea > Odontoceti > Delphinidae
Key characteristic: It's a marine mammal, not a fish.
Distinguishing features from fish: Breaths air with lungs (not gills), gives birth to live young (viviparous), nurses them with milk, has hair (though minimal), is warm-blooded, etc.
Formulate Response (Internal Refinement - Chinese):
Direct answer: 不是。
Explanation: 虎鲸是哺乳动物，属于鲸目齿鲸亚目海豚科。
Key differences from fish: 用肺呼吸、胎生、哺乳、恒温等。
Keep it concise as requested.
Draft Response (Chinese):
不是。虎鲸是哺乳动物，属于鲸类（海豚科）。它们用肺呼吸、胎生、哺乳、恒温，与鱼类在分类和生理结构上完全不同。

Check against constraints:

Plain, complete, no moralizing/disclaimers
Follow exact format if requested (none here)
Concise by default
Matches language (Chinese)
Accurate scientifically
All good. Ready to output.✅

在这里插入图片描述

同一个问题，Think耗时缩短到1min49s。

Token生成速度

在这里插入图片描述

明显打到了3.5t/s。

对比情况

相同的模型，同一个问题，结果差异不大，性能提升还是比较明显的：

Mode	默认情况的Time	提高GPU卸载后的Time	提升比例
Think	3min58s	1min49s	54.2%
Answer	2.64t/s	3.5t/s	32.58%

提升还是蛮可观的。

转载请注明出处：https://lizhiyong.blog.csdn.net/article/details/161386515

在这里插入图片描述

加入AMD AI开发者计划！

免费领 200 小时云算力，进群参与显卡、AI PC 幸运抽奖

更多推荐

VLLMService Operator 开发第六篇：给模型服务增加 Service 自动创建能力

做到这里，VLLMService Operator 已经不只是创建 Deployment 了，它开始具备完整服务编排的雏形。-> Pod-> Pod-> Service这一步非常关键，因为 Service 是后续接入流量入口的基础。没有 Service，HTTPRoute 就没有稳定的后端目标；-> Pod-> Service-> Gateway也就是说，Service 是从“模型 Pod 能跑起

AMD开发者中国社区

VLLMService Operator 开发第五篇：部署 Operator 并验证模型服务

存储准备完成后，就可以创建 VLLMService 资源了。metadata:spec:labels:port: 8000resources:requests:cpu: "2"limits:cpu: "4"storage:这里需要明确一点：VLLMService 本身只是用户声明的期望状态，它并不会直接运行模型。真正运行模型的是 Operator 根据这个 VLLMService 自动创建出来的

AMD开发者中国社区

Higress v2.2.3 发布：AI Gateway 能力增强，Gateway API 及其推理扩展持续打磨

Higress v2.2.3 版本发布，带来多项核心更新：新增 AI Gateway 的 vLLM 协议透传和上下文限制插件，增强 AI 安全防护与负载均衡；优化 Gateway API 隔离与推理路由配置；改进 Ingress 迁移体验；加固安全认证组件与 TLS 校验；修复多项 Console 问题。社区层面，Higress 正式入驻 CNCF Sandbox。