vLLM cpu版可以支持哪些流行的大模型

本文记录了在CPU环境下编译安装vLLM并调试文心ERNIE-4.5系列模型的过程。测试发现ERNIE-4.5-0.3B小模型可以正常运行，但28B大模型调试失败，出现多种错误：包括需添加trust_remote_code参数、内存不足、AVX指令集缺失等问题。最终通过设置dtype=float和max_model_len参数降低内存需求后，仍因虚拟机内存溢出而终止测试。结论表明CPU仅适合运行小

天马行空skywalk

947人浏览 · 2025-11-12 12:13:22

天马行空skywalk · 2025-11-12 12:13:22 发布

vLLM cpu版的编译安装参考：https://skywalk.blog.csdn.net/article/details/154336915

经过测试文心ernie4.5-0.3b模型可以使用

28b模型没调通

同步启动

export VLLM_USE_ASYNC_ENGINE=False

dtype设为float ，max_model_len 设为默认的一半

vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust_remote_code --max_model_len 37440 --dtype float

但是竟然SCNet的虚拟机被整崩了，道心破碎了，不去弄它了。

deepseek 14b模型正在调试

等以后心情恢复了再测试吧。

结论

cpu跑小模型还是可以的，比如ernie4.5-0.3b

但是，老的cpu无法跑，新的cpu，说实话，一般都配置gpu、dcu等加速卡，也轮不到cpu跑啊！

调试

28b模型报错

(APIServer pid=9014) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
(APIServer pid=9014) s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=9014) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=9014) Value error, The repository baidu/ERNIE-4.5-VL-28B-A3B-Thinking contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking .
(APIServer pid=9014) You can inspect the repository content at https://hf.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking.
(APIServer pid=9014) Please pass the argument `trust_remote_code=True` to allow custom code to be run. [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]

安装提示，加入trust_remote_code=True

vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust_remote_code

启动模型报错 Run `pip install decord`

(APIServer pid=9729) ImportError: This modeling file requires the following packages that were not found in your environment: decord. Run `pip install decord`

按照提示安装

启动模型报错Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

(EngineCore_DP0 pid=10197) raise ValueError(
(EngineCore_DP0 pid=10197) ValueError: To serve at least one request with the models's max seq len (131072), (7.00 GiB KV cache is needed, which is larger than the available KV cache memory (4.00 GiB). Based on the available memory, the estimated maximum model length is 74880. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

按照提示，将max_model_len

vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust_remote_code --max_model_len 37440

报错Please set dtype to torch.float or set weights_prepack to False.

(EngineCore_DP0 pid=10587) ERROR 11-12 11:19:28 [core.py:855] AssertionError: BF16 weight prepack needs the cpu support avx_ne_convert or avx512bw, avx512vl and avx512dq, but the desired instruction sets are not available. Please set dtype to torch.float or set weights_prepack to False.
禁用权重预打包试试

export VLLM_WEIGHTS_PREPACK=False

不行

试试dtype设为float

vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust_remote_code --max_model_len 37440 --dtype float

这个可以，但是出现新的问题

报错(APIServer pid=11605) with launch_core_engines(vllm_config, executor_class, log_stats) as (

(APIServer pid=11605) Traceback (most recent call last):
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/bin/vllm", line 33, in <module>
(APIServer pid=11605) sys.exit(load_entry_point('vllm==0.11.1rc7.dev21+gd381eb967.cpu', 'console_scripts', 'vllm')())
(APIServer pid=11605) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=11605) args.dispatch_function(args)
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 59, in cmd
(APIServer pid=11605) uvloop.run(run_server(args))
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=11605) return __asyncio.run(
(APIServer pid=11605) ^^^^^^^^^^^^^^
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=11605) return runner.run(main)
(APIServer pid=11605) ^^^^^^^^^^^^^^^^
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=11605) return self._loop.run_until_complete(task)
(APIServer pid=11605) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11605) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=11605) return await main
(APIServer pid=11605) ^^^^^^^^^^
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1944, in run_server
(APIServer pid=11605) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1963, in run_server_worker
(APIServer pid=11605) async with build_async_engine_client(
(APIServer pid=11605) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=11605) return await anext(self.gen)
(APIServer pid=11605) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client
(APIServer pid=11605) async with build_async_engine_client_from_engine_args(
(APIServer pid=11605) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=11605) return await anext(self.gen)
(APIServer pid=11605) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 233, in build_async_engine_client_from_engine_args
(APIServer pid=11605) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=11605) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/vllm/utils/func_utils.py", line 116, in inner
(APIServer pid=11605) return fn(*args, **kwargs)
(APIServer pid=11605) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 202, in from_vllm_config
(APIServer pid=11605) return cls(
(APIServer pid=11605) ^^^^
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 132, in __init__
(APIServer pid=11605) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=11605) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 121, in make_async_mp_client
(APIServer pid=11605) return AsyncMPClient(*client_args)
(APIServer pid=11605) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 808, in __init__
(APIServer pid=11605) super().__init__(
(APIServer pid=11605) File "/root/.pyenv/versions/3.12.12/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 469, in __init__
(APIServer pid=11605) with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=11605) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1160