vllm 本地部署大模型
本文介绍了Qwen3-8B大模型的部署流程。首先通过pip安装torch、transformers和vllm等必要依赖,然后使用modelscope下载模型文件。接着详细说明了单卡A800服务器上的vLLM服务启动命令,包括指定模型路径、服务名称、端口号等参数配置。启动日志显示系统自动检测CUDA平台,并成功初始化vLLM引擎。整个过程涵盖了从环境配置到服务部署的关键步骤,为Qwen3-8B模型的
·
⭕️第一步:配置服务器环境
pip install torch==2.6.0
pip install transformers==4.51.3
pip install vllm==0.8.5
下载模型—download.py
from modelscope import snapshot_download
model_dir=snapshot_download('Qwen/Qwen3-8B',cache_dir='./Qwen3-8B_model')
print('model_dir: ', model_dir)
⭕️第二步:启动Vllm服务
单卡部署—A800
python3 -m vllm.entrypoints.openai.api_server \
--model './Qwen3-8B' \
--served-model-name 'qwen3-8b' \
--host 0.0.0.0 \
--port 6006 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--dtype=half
启动信息如下👇:
python3 -m vllm.entrypoints.openai.api_server \
--model './Qwen3-8B' \
--served-model-name 'qwen3-8b' \
--host 0.0.0.0 \
--port 6006 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--dtype=half
INFO 05-17 13:48:12 [__init__.py:239] Automatically detected platform cuda.
INFO 05-17 13:48:16 [api_server.py:1043] vLLM API server version 0.8.5
INFO 05-17 13:48:16 [api_server.py:1044] args: Namespace(host='0.0.0.0', port=6006, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='./Qwen3-8B', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='half', max_model_len=None, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=['qwen3-8b'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
WARNING 05-17 13:48:16 [config.py:2972] Casting torch.bfloat16 to torch.float16.
INFO 05-17 13:48:23 [config.py:717] This model supports multiple tasks: {'generate', 'score', 'embed', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 05-17 13:48:23 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-17 13:48:28 [__init__.py:239] Automatically detected platform cuda.
INFO 05-17 13:48:31 [core.py:58] Initializing a V1 LLM engine (v0.8.5) with config: model='./Qwen3-8B', speculative_config=None, tokenizer='./Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=qwen3-8b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 05-17 13:48:51 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f58945252d0>
INFO 05-17 13:48:52 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-17 13:48:52 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 05-17 13:48:52 [topk_topp_sampler.py:44] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling.
INFO 05-17 13:48:52 [gpu_model_runner.py:1329] Starting to load model ./Qwen3-8B...
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:23<01:35, 23.96s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:46<01:08, 22.89s/it]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [01:06<00:43, 21.89s/it]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [01:23<00:19, 19.79s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [01:28<00:00, 14.42s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [01:28<00:00, 17.65s/it]
INFO 05-17 13:50:21 [loader.py:458] Loading weights took 88.30 seconds
INFO 05-17 13:50:21 [gpu_model_runner.py:1347] Model loading took 15.2683 GiB and 88.680642 seconds
INFO 05-17 13:50:32 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/71e6c18f16/rank_0_0 for vLLM's torch.compile
INFO 05-17 13:50:32 [backends.py:430] Dynamo bytecode transform time: 10.94 s
INFO 05-17 13:50:37 [backends.py:136] Cache the graph of shape None for later use
INFO 05-17 13:51:18 [backends.py:148] Compiling a graph for general shape takes 45.33 s
INFO 05-17 13:51:53 [monitor.py:33] torch.compile takes 56.27 s in total
INFO 05-17 13:51:54 [kv_cache_utils.py:634] GPU KV cache size: 363,632 tokens
INFO 05-17 13:51:54 [kv_cache_utils.py:637] Maximum concurrency for 40,960 tokens per request: 8.88x
INFO 05-17 13:52:35 [gpu_model_runner.py:1686] Graph capturing finished in 41 secs, took 0.54 GiB
INFO 05-17 13:52:36 [core.py:159] init engine (profile, create kv cache, warmup model) took 134.64 seconds
INFO 05-17 13:52:36 [core_client.py:439] Core engine process 0 ready.
WARNING 05-17 13:52:36 [config.py:1239] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 05-17 13:52:36 [serving_chat.py:118] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 05-17 13:52:36 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 05-17 13:52:36 [api_server.py:1090] Starting vLLM API server on http://0.0.0.0:6006
INFO 05-17 13:52:36 [launcher.py:28] Available routes are:
INFO 05-17 13:52:36 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
INFO 05-17 13:52:36 [launcher.py:36] Route: /docs, Methods: HEAD, GET
INFO 05-17 13:52:36 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 05-17 13:52:36 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
INFO 05-17 13:52:36 [launcher.py:36] Route: /health, Methods: GET
INFO 05-17 13:52:36 [launcher.py:36] Route: /load, Methods: GET
INFO 05-17 13:52:36 [launcher.py:36] Route: /ping, Methods: GET, POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 05-17 13:52:36 [launcher.py:36] Route: /version, Methods: GET
INFO 05-17 13:52:36 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /pooling, Methods: POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /score, Methods: POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /rerank, Methods: POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /invocations, Methods: POST
INFO 05-17 13:52:36 [launcher.py:36] Route: /metrics, Methods: GET
INFO: Started server process [10879]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Started server process [10879]
INFO: Waiting for application startup.
INFO: Application startup complete.
这三行信息表示启动成功✅
查看GPU利用情况
watch -n 1 nvidia-smi
服务启动之后显存占用情况:
Sun May 17 13:53:41 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A800-SXM4-80GB On | 00000000:23:00.0 Off | 0 |
| N/A 29C P0 83W / 400W | 73771MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
A800显卡几乎全部占用
⭕️第三步:编写测试代码—test.py
import requests
import time
url='http://0.0.0.0:6006/v1/completions'
headers={"Content-Type":"application/json"}
def query_vllm(prompt,model_name='qwen3-8b',max_tokens=2000,temperature=0.7):
data={
'model':model_name,
'prompt':prompt,
'max_tokens':max_tokens,
'temperature':temperature
}
try:
response = requests.post(url, headers=headers, json=data, timeout=60)
response.raise_for_status()
return response.json()["choices"][0]["text"]
except requests.exceptions.RequestException as e:
return f"请求出错: {str(e)}"
#使用实例
start_time=time.time()
result=query_vllm('解释一下解析几何的基本概念')
end_time=time.time()
print(result)
print('cost_time:',end_time-start_time)
⭕️第四步:新建终端,运行测试代码
python test.py
运行结果:
➜ /workspace git:(master) ✗ python vllm_test.py
,包括坐标系、点、直线、圆、椭圆、抛物线、双曲线等。 解析几何,又称坐标几何,是用代数的方法来研究几何图形的学科。它通过将几何问题转化为代数方程,从而利用代数工具解决几何问题。解析几何的基本概念包括坐标系、点、直线、圆、椭圆、抛物线、双曲线等。以下是对这些基本概念的解释:
### 1. 坐标系
坐标系是解析几何的基础,用于确定平面上或空间中点的位置。最常见的坐标系是**笛卡尔坐标系**(Cartesian Coordinate System),它由两条互相垂直的数轴组成,通常称为x轴和y轴,它们的交点称为原点(O)。在三维空间中,还会有一个z轴。
- **二维坐标系**:平面上的点用有序对(x, y)表示。
- **三维坐标系**:空间中的点用有序三元组(x, y, z)表示。
### 2. 点
点是几何中的基本元素,表示空间中的一个位置。在解析几何中,点由坐标系中的坐标确定。例如,在二维坐标系中,点P的坐标为(x, y),其中x和y是该点在x轴和y轴上的投影。
### 3. 直线
直线是两点之间最短路径的集合。在解析几何中,直线可以用方程来表示,常见的直线方程形式包括:
- **斜截式**:y = kx + b,其中k是斜率,b是y轴截距。
- **两点式**:已知两点(x₁, y₁)和(x₂, y₂),直线方程为(y - y₁) = [(y₂ - y₁)/(x₂ - x₁)](x - x₁)。
- **一般式**:Ax + By + C = 0,其中A、B、C为常数。
### 4. 圆
圆是平面上到定点(圆心)的距离等于定长(半径)的所有点的集合。圆的标准方程为:
- **标准式**:(x - a)² + (y - b)² = r²,其中(a, b)是圆心,r是半径。
- **一般式**:x² + y² + Dx + Ey + F = 0,其中D、E、F为常数。
### 5. 椭圆
椭圆是平面上到两个定点(焦点)的距离之和为常数的点的集合。椭圆的标准方程为:
- **标准式**:(x²/a²) + (y²/b²) = 1,其中a和b是长半轴和短半轴的长度,焦点位于长轴上。
- **参数方程**:x = a cosθ,y = b sinθ,θ为参数。
### 6. 抛物线
抛物线是平面上到定点(焦点)和定直线(准线)的距离相等的点的集合。抛物线的标准方程包括:
- **开口向右**:y² = 4ax,其中a是焦点到顶点的距离。
- **开口向上**:x² = 4ay,其中a是焦点到顶点的距离。
- **一般式**:y = ax² + bx + c 或 x = ay² + by + c。
### 7. 双曲线
双曲线是平面上到两个定点(焦点)的距离之差为常数的点的集合。双曲线的标准方程为:
- **标准式**:(x²/a²) - (y²/b²) = 1(水平双曲线)或(y²/a²) - (x²/b²) = 1(垂直双曲线)。
- **参数方程**:x = a secθ,y = b tanθ(水平双曲线)或x = a tanθ,y = b secθ(垂直双曲线)。
### 8. 其他概念
- **斜率**:直线的斜率表示其倾斜程度,计算公式为k = (y₂ - y₁)/(x₂ - x₁)。
- **距离公式**:两点P₁(x₁, y₁)和P₂(x₂, y₂)之间的距离为√[(x₂ - x₁)² + (y₂ - y₁)²]。
- **中点公式**:两点P₁(x₁, y₁)和P₂(x₂, y₂)的中点坐标为((x₁ + x₂)/2, (y₁ + y₂)/2)。
### 9. 三维解析几何
在三维空间中,除了x轴和y轴,还有z轴。点的坐标为(x, y, z),直线和曲面可以用类似的方法表示,例如:
- **平面方程**:Ax + By + Cz + D = 0。
- **球面方程**:(x - a)² + (y - b)² + (z - c)² = r²。
### 总结
解析几何通过将几何图形与代数方程联系起来,使得几何问题可以通过代数运算来解决。掌握这些基本概念和方程是学习更高级解析几何和应用数学的基础。例如,在物理学、工程学和计算机图形学中,解析几何被广泛应用于建模和问题求解。通过理解这些概念,可以更深入地分析和解决各种几何问题。😊
在解析几何中,点、直线、圆、椭圆、抛物线和双曲线等基本概念构成了几何图形的基础,通过代数方程的形式描述它们的位置和性质。这些概念不仅在数学理论中占有重要地位,还在实际应用中发挥着关键作用。例如,圆和椭圆在天文学中用于描述行星轨道,抛物线和双曲线在工程学中用于设计抛物面镜和卫星轨道。通过解析几何,我们能够将复杂的几何问题转化为代数问题,从而利用代数工具进行分析和求解。😊
**扩展知识**:
- **参数方程与极坐标**:除了笛卡尔坐标系,解析几何还涉及极坐标系,其中点的位置由极径(距离原点的距离)和极角(与极轴的夹角)确定。
- **向量与解析几何**:向量在解析几何中用于表示方向和大小,通过向量运算可以更简便地处理几何问题。
- **微积分与解析几何**:解析几何为微积分提供了几何基础,例如导数可以用来研究曲线的斜率,积分可以用来计算曲线围成的面积。
通过深入学习这些概念,可以进一步探索更复杂的几何图形和应用,如三维几何、参数曲面、曲线拟合等。解析几何不仅是一门理论学科,更是连接数学与现实世界的桥梁。😊
在解析几何中,点、直线、圆、椭圆、抛物线和双曲线等基本概念构成了几何图形的基础,通过代数方程的形式描述它们的位置和性质。这些概念不仅在数学理论中占有重要地位,还在实际应用中发挥着关键作用。例如,圆和椭圆在天文学中用于描述行星轨道,抛物线和双曲线在工程学中用于设计抛物面镜和卫星轨道。通过解析几何,我们能够将复杂的几何问题转化为代数问题,从而利用代数工具进行分析和求解。😊
**扩展知识**:
- **参数方程与极坐标**:除了笛卡尔坐标系,解析几何还涉及极坐标系,其中点的位置由极径(距离原点的距离)和极角(与极轴的夹角)确定。
- **向量与解析几何**:向量在解析几何中用于表示方向和大小,通过向量运算可以更简便地处理几何问题。
- **微积分与解析几何**:解析几何为微积分提供了几何基础,例如导数可以用来研究曲线的斜率,积分可以用来计算曲线围成的面积。
通过深入学习这些概念,可以进一步探索更复杂的几何图形和应用,如三维几何、参数曲面、曲线拟合等。解析几何不仅是一门理论学科,更是连接数学与现实世界的桥梁。😊
解析几何是数学中一个重要的分支,它通过代数方法研究几何图形的性质和关系。以下是对解析几何基本概念的详细解释,涵盖坐标系、点、直线、圆、椭圆、抛物线和双曲线等:
---
### **1. 坐标系**
- **笛卡尔坐标系**:由两条互相垂直的数轴(x轴和y轴)组成,原点为交点。平面上的点用有序对(x, y)表示;空间中增加z轴,点用(x, y, z)表示。
- **极坐标系**:用极径(r)和极角(θ)表示点的位置,其中r是点到原点的距离,θ是点与极轴的夹角。转换公式:x = r cosθ,y = r sinθ。
---
### **2. 点**
- **定义**:几何中的基本元素,表示空间中的
cost_time: 25.24479031562805
更多推荐

所有评论(0)