vllm部署Qwen-VL
本文介绍了使用vLLM框架部署Qwen2-VL-2B-Instruct模型的过程。通过运行vLLM serve命令启动服务,系统显示模型版本为0.19.0,采用torch.bfloat16数据类型,最大序列长度为32768。日志信息包含模型初始化配置、异步调度启用提示,以及图像处理器类型变更警告。输出还展示了引擎初始化参数,包括并行处理设置、量化选项和动态形状配置等详细信息。整个过程展示了如何使用
·
首先搭建环境(略)并下载好huggingface的Qwen2-VL-2B-Instruct模型。
开启一个终端运行命令:
vllm serve ../TensorRT-LLM-1.2.0/Qwen2-VL-2B-Instruct
输出内容:
(APIServer pid=65324) INFO 05-12 08:32:04 [utils.py:299]
(APIServer pid=65324) INFO 05-12 08:32:04 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=65324) INFO 05-12 08:32:04 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0
(APIServer pid=65324) INFO 05-12 08:32:04 [utils.py:299] █▄█▀ █ █ █ █ model ../TensorRT-LLM-1.2.0/Qwen2-VL-2B-Instruct
(APIServer pid=65324) INFO 05-12 08:32:04 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=65324) INFO 05-12 08:32:04 [utils.py:299]
(APIServer pid=65324) INFO 05-12 08:32:04 [utils.py:233] non-default args: {'model_tag': '../TensorRT-LLM-1.2.0/Qwen2-VL-2B-Instruct', 'model': '../TensorRT-LLM-1.2.0/Qwen2-VL-2B-Instruct'}
(APIServer pid=65324) INFO 05-12 08:32:04 [model.py:549] Resolved architecture: Qwen2VLForConditionalGeneration
(APIServer pid=65324) INFO 05-12 08:32:04 [model.py:1678] Using max model len 32768
(APIServer pid=65324) INFO 05-12 08:32:04 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=65324) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore pid=65997) INFO 05-12 08:32:18 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='../TensorRT-LLM-1.2.0/Qwen2-VL-2B-Instruct', speculative_config=None, tokenizer='../TensorRT-LLM-1.2.0/Qwen2-VL-2B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=../TensorRT-LLM-1.2.0/Qwen2-VL-2B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=65997) WARNING 05-12 08:32:20 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(EngineCore pid=65997) INFO 05-12 08:32:21 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.65.3:56879 backend=nccl
(EngineCore pid=65997) INFO 05-12 08:32:21 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=65997) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore pid=65997) INFO 05-12 08:32:25 [gpu_model_runner.py:4735] Starting to load model ../TensorRT-LLM-1.2.0/Qwen2-VL-2B-Instruct...
(EngineCore pid=65997) INFO 05-12 08:32:25 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=65997) INFO 05-12 08:32:25 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=65997) INFO 05-12 08:32:25 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=65997) INFO 05-12 08:32:28 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=65997) INFO 05-12 08:32:28 [flash_attn.py:596] Using FlashAttention version 2
(EngineCore pid=65997) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=65997) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:31<00:00, 31.11s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:31<00:00, 31.11s/it]
(EngineCore pid=65997)
(EngineCore pid=65997) INFO 05-12 08:32:59 [default_loader.py:384] Loading weights took 31.31 seconds
(EngineCore pid=65997) INFO 05-12 08:33:00 [gpu_model_runner.py:4820] Model loading took 4.15 GiB memory and 34.534105 seconds
(EngineCore pid=65997) INFO 05-12 08:33:00 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 32768 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore pid=65997) INFO 05-12 08:33:45 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/d39acccb88/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=65997) INFO 05-12 08:33:45 [backends.py:1111] Dynamo bytecode transform time: 1.34 s
(EngineCore pid=65997) INFO 05-12 08:33:47 [backends.py:285] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 1.033 s
(EngineCore pid=65997) INFO 05-12 08:33:47 [decorators.py:303] Directly load AOT compilation from path /root/.cache/vllm/torch_compile_cache/torch_aot_compile/0daf28c8dec94adb05b9854b3246da4bd5298b7409e6451e0b86fe06b241845a/rank_0_0/model
(EngineCore pid=65997) INFO 05-12 08:33:47 [monitor.py:48] torch.compile took 2.70 s in total
(EngineCore pid=65997) INFO 05-12 08:33:47 [monitor.py:76] Initial profiling/warmup run took 0.01 s
(EngineCore pid=65997) INFO 05-12 08:33:47 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=65997) INFO 05-12 08:33:47 [gpu_model_runner.py:5876] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=65997) INFO 05-12 08:34:29 [gpu_model_runner.py:5955] Estimated CUDA graph memory: 0.32 GiB total
(EngineCore pid=65997) INFO 05-12 08:34:29 [gpu_worker.py:436] Available KV cache memory: 1.52 GiB
(EngineCore pid=65997) INFO 05-12 08:34:29 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9272 to maintain the same effective KV cache size.
(EngineCore pid=65997) INFO 05-12 08:34:29 [kv_cache_utils.py:1319] GPU KV cache size: 56,960 tokens
(EngineCore pid=65997) INFO 05-12 08:34:29 [kv_cache_utils.py:1324] Maximum concurrency for 32,768 tokens per request: 1.74x
(EngineCore pid=65997) 2026-05-12 08:34:29,720 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore pid=65997) 2026-05-12 08:34:29,733 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█| 51/51 [00:02<00:00,
Capturing CUDA graphs (decode, FULL): 100%|█████████| 35/35 [00:01<00:00, 34.16it/s]
(EngineCore pid=65997) INFO 05-12 08:34:33 [gpu_model_runner.py:6046] Graph capturing finished in 3 secs, took 0.27 GiB
(EngineCore pid=65997) INFO 05-12 08:34:33 [gpu_worker.py:597] CUDA graph pool memory: 0.27 GiB (actual), 0.32 GiB (estimated), difference: 0.06 GiB (21.2%).
(EngineCore pid=65997) INFO 05-12 08:34:33 [core.py:283] init engine (profile, create kv cache, warmup model) took 92.98 seconds
(APIServer pid=65324) WARNING 05-12 08:34:33 [interface.py:525] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(APIServer pid=65324) INFO 05-12 08:34:33 [api_server.py:590] Supported tasks: ['generate']
(APIServer pid=65324) WARNING 05-12 08:34:34 [model.py:1435] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.01, 'top_k': 1, 'top_p': 0.001}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=65324) INFO 05-12 08:34:35 [hf.py:314] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=65324) INFO 05-12 08:34:37 [base.py:231] Multi-modal warmup completed in 2.185s
(APIServer pid=65324) INFO 05-12 08:34:37 [api_server.py:594] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:37] Available routes are:
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=65324) INFO 05-12 08:34:37 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=65324) INFO: Started server process [65324]
(APIServer pid=65324) INFO: Waiting for application startup.
(APIServer pid=65324) INFO: Application startup complete.
看到最后一行Application startup complete.说明模型加载完成。
测试代码:
import requests
import json
from PIL import Image
import base64
def encode_image(image_path): # 编码本地图片的函数
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# 1.url
url = 'http://localhost:8000/v1/chat/completions'
# 2.data
image_path = "bus.jpg"
base64_image = encode_image(image_path) # 编码本地图片
data = {"model": "/Qwen2-VL-2B-Instruct",
"messages": [{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user",
"content": [
{"type": "image_url","image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}},
{"type": "text", "text": "图片中有什么?"},],}],
"temperature": 0.7,"top_p": 0.8,"repetition_penalty": 1.05,"max_tokens": 1024}
# 3.将字典转换为 JSON 字符串
json_payload = json.dumps(data)
# 4.发送 POST 请求
headers = {'Content-Type': 'application/json'}
response = requests.post(url, data=json_payload, headers=headers)
# 5.打印响应内容
print(response.json().get("choices", [])[0].get("message", []).get("content", []))
测试图片bus.jpg:
运行上面demo输出:
(APIServer pid=65324) INFO: 127.0.0.1:40600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=65324) INFO 05-12 08:36:07 [loggers.py:259] Engine 000: Avg prompt throughput: 116.6 tokens/s, Avg generation throughput: 4.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(APIServer pid=65324) INFO 05-12 08:36:17 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
这张图片展示了一辆蓝色的电动公交车,车身上有“cero emisiones”(零排放)的标志。公交车停在街道上,周围有一些行人。背景中可以看到一些建筑物和树木。
占用显存约7G。
更多推荐

所有评论(0)