LLM | Xinference 安装使用（支持CPU、Metal、CUDA推理和分布式部署）

也支持多卡模型并行推理

ulimpid

1946人浏览 · 2024-09-30 23:58:17

ulimpid · 2024-09-30 23:58:17 发布

1. 详细步骤

1.1 安装

# CUDA/CPU
pip install "xinference[transformers]"
pip install "xinference[vllm]"
pip install "xinference[sglang]"

# Metal(MPS)
pip install "xinference[mlx]"
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

注：可能是 nvcc 版本等个人环境配置原因，llama-cpp-python 在 CUDA 上无法使用（C/C++ 环境上是正常的），Metal 的 llama-cpp-python 正常。如需安装 flashinfer 等依赖见官方安装文档：https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html

1.2 启动

1.2.1 直接启动

简洁命令

xinference-local --host 0.0.0.0 --port 9997

多参数命令

设置模型缓存路径和模型来源（Hugging Face/Modelscope）

# CUDA/CPU
XINFERENCE_HOME=/path/.xinference XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997

# Metal(MPS)
XINFERENCE_HOME=/path/.xinference XINFERENCE_MODEL_SRC=modelscope PYTORCH_ENABLE_MPS_FALLBACK=1 xinference-local --host 0.0.0.0 --port 9997

1.2.2 集群部署

通过 ifconfig 查看当前服务器IP

1.2.2.1 主服务器启动 Supervisor

# 格式
xinference-supervisor -H 当前服务器IP(主服务器IP) --port 9997

# 示例
xinference-supervisor -H 192.168.31.100 --port 9997

1.2.2.2 其他服务器启动 Worker

# 格式
xinference-worker -e "http://${主服务器IP}:9997" -H 当前服务器IP(子服务器IP)

# 示例
xinference-worker -e "http://192.168.31.100:9997" -H 192.168.31.101

注：按需添加XINFERENCE_HOME、XINFERENCE_MODEL_SRC、PYTORCH_ENABLE_MPS_FALLBACK等环境变量（启动时参数）

1.3 使用

访问 http://主服务器IP:9997/docs 查看接口文档，访问 http://主服务器IP:9997 正常使用

2. 参考资料

2.1 Xinference

2.1.1 部署文档

本地运行 Xinference

https://inference.readthedocs.io/zh-cn/latest/getting_started/using_xinference.html#run-xinference-locally

集群中部署 Xinference

https://inference.readthedocs.io/zh-cn/latest/getting_started/using_xinference.html#deploy-xinference-in-a-cluster

2.1.2 安装文档

官方页面

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html

Transformers 引擎

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#transformers-backend

vLLM 引擎

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#vllm-backend

Llama.cpp 引擎

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#llama-cpp-backend

MLX 引擎

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#mlx-backend

3. 资源

3.1 Xinference

3.1.1 GitHub

官方页面

https://github.com/xorbitsai/inference

https://github.com/xorbitsai/inference/blob/main/README_zh_CN.md

3.1.2 安装文档

SGLang 引擎

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#sglang-backend

其他平台（在昇腾 NPU 上安装）

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#other-platforms

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation_npu.html#installation-npu

西安城市开发者社区

欢迎加入西安开发者社区！我们致力于为西安地区的开发者提供学习、合作和成长的机会。参与我们的活动，与专家分享最新技术趋势，解决挑战，探索创新。加入我们，共同打造技术社区！

更多推荐

猫头虎分享：鸿蒙生态带给开发者的全新机遇！轻松实现按需加载与多端适配，开发效率翻倍

西安城市开发者社区

显示器不亮？解决“显示器不支持当前的输入时序，请将输入时序更改为 1920x1080, 60Hz”的终极指南

西安城市开发者社区

如何解决ffmpeg安装报错ERROR: You have requested merging of multiple formats but ffmpeg is not installed

西安城市开发者社区

所有评论(0)

查看更多评论

ulimpid

@be_clever

已为社区贡献1条内容