【大模型推理】sglang 内存计算

执笔论英雄

1301人浏览 · 2025-11-28 15:27:51

执笔论英雄 · 2025-11-28 15:27:51 发布

5-8GB 的可用内存范围是 SGLang 推荐的合理设置，这个范围确保了系统有足够的内存用于动态分配，同时最大化 KV cache 容量。 1

为什么是 5-8GB

内存分配原理

SGLang 的 GPU 内存分配遵循以下公式： 2

总内存 = 模型权重 + KV cache池 + CUDA图缓冲区 + 激活内存

其中，激活内存和 CUDA 图缓冲区需要预留 5-8GB 空间。

计算逻辑

在 server_args.py 中，SGLang 通过以下方式计算预留内存： 3

# 基础预留内存
reserved_mem = 512  # 常量元数据

# 激活内存（与 chunked_prefill_size 相关）
reserved_mem += max(self.chunked_prefill_size, 2048) * 1.5

# CUDA 图内存
reserved_mem += self.cuda_graph_max_bs * 2

# 并行开销
reserved_mem += self.tp_size * self.pp_size / 8 * 1024

不同范围的含义

5-8GB（合适）：预留了足够的激活和 CUDA 图内存，系统运行稳定 4
10-20GB（过高）：可以增加 --mem-fraction-static 来分配更多内存给 KV cache
过低（<5GB）：可能导致 OOM 错误，需要降低 --mem-fraction-static

实际验证

您可以通过日志中的 available_gpu_mem 值来判断： 5

[2025-08-11 17:17:03] max_total_num_tokens=665690, chunked_prefill_size=8192, 
available_gpu_mem=13.50 GB

调优建议

如果可用内存不在 5-8GB 范围内：

内存过高：逐步增加 --mem-fraction-static（如 0.01 为单位） 6
内存过低：降低 --mem-fraction-static 到 0.8 或 0.7 7

Notes

5-8GB 是一个经验值，基于典型工作负载的激活内存需求。实际需求可能因模型大小、批次长度和并行配置而异，但这个范围在大多数场景下都能提供良好的平衡。

Wiki pages you might want to explore:

Overview (sgl-project/sglang)

Citations

File: docs/advanced_features/hyperparameter_tuning.md (L29-37)

SGLang allocates memory as follows:

Total memory usage = model weights + KV cache pool + CUDA graph buffers + activations

The `--mem-fraction-static` parameter determines how much memory is allocated to the first two components:

mem_fraction_static = (model weights + KV cache pool) / GPU memory capacity

To support higher concurrency, you should maximize the KV cache pool capacity by setting `--mem-fraction-static` as high as possible while still reserving enough memory for activations and CUDA graph buffers.

File: docs/advanced_features/hyperparameter_tuning.md (L40-50)

As a rule of thumb, reserving 5–8 GB of memory for activations is typically sufficient. You can check this by inspecting the logs just before the server is ready.
Look for log entries like this:

[2025-08-11 17:17:03] max_total_num_tokens=665690, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=65536, available_gpu_mem=13.50 GB


Check the `available_gpu_mem` value.
- If it is between 5–8 GB, the setting is good.
- If it is too high (e.g., 10 - 20 GB), increase `--mem-fraction-static` to allocate more memory to the KV cache.
- If it is too low, you risk out-of-memory (OOM) errors later, so decrease `--mem-fraction-static`.

File: docs/advanced_features/hyperparameter_tuning.md (L52-53)

Another straightforward approach is to increase `--mem-fraction-static` in increments of 0.01 until you encounter OOM errors for your workloads.

File: docs/advanced_features/hyperparameter_tuning.md (L58-61)

- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
- If OOM occurs during decoding, try lowering `--max-running-requests`.
- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.

File: python/sglang/srt/server_args.py (L812-854)

        if self.mem_fraction_static is None:
            # Constant meta data (e.g., from attention backend)
            reserved_mem = 512
            # For activation during large prefill
            if self.chunked_prefill_size > 0:
                reserved_mem += max(self.chunked_prefill_size, 2048) * 1.5
            else:
                reserved_mem += max(self.max_prefill_tokens, 2048) * 1.5
            # For cuda graphs
            reserved_mem += self.cuda_graph_max_bs * 2
            # Some adjustments for large parallel size
            reserved_mem += self.tp_size * self.pp_size / 8 * 1024

            if self.enable_dp_attention:
                # DP attention needs more padding for some operations
                reserved_mem += self.cuda_graph_max_bs * self.dp_size * 3

                # DP attention uses much more memory for large cuda graph max bs,
                # likely due to some inefficiencies in torch allocator or our implementation.
                # So we need to reserve more memory.
                if self.cuda_graph_max_bs > 300:
                    reserved_mem += self.cuda_graph_max_bs * self.dp_size * 1.5

            if gpu_mem is not None and gpu_mem > 60 * 1024:
                reserved_mem = max(reserved_mem, 10 * 1024)

            if self.speculative_algorithm is not None:
                if self.speculative_algorithm == "STANDALONE":
                    # standalonedraft model and cuda graphs
                    reserved_mem += 6 * 1024
                elif self.speculative_algorithm != "NGRAM":
                    # eagle draft models and cuda graphs
                    reserved_mem += 2 * 1024

            # For piecewise cuda graphs
            if self.enable_piecewise_cuda_graph:
                reserved_mem += self.piecewise_cuda_graph_max_tokens // 4

            self.mem_fraction_static = (
                round((gpu_mem - reserved_mem) / gpu_mem, 3)
                if gpu_mem is not None
                else 0.88
            )

您提出了一个很好的问题。5-8GB 的合理范围确实不看模型大小，因为这个预留内存主要用于动态计算，而不是存储模型权重。 1

为什么与模型大小无关

5-8GB 预留的是激活内存和 CUDA 图缓冲区，这些主要用于：

前向传播的激活值：与批次大小和序列长度相关，而非模型参数量 2
CUDA 图缓冲区：用于推理优化，大小由 cuda_graph_max_bs 决定 3
临时计算空间：注意力计算、MoE 专家路由等动态操作

计算逻辑验证

在 server_args.py 中，预留内存的计算确实不直接依赖模型大小： 4

# 基础预留内存（固定）
reserved_mem = 512
# 激活内存（基于 chunked_prefill_size，不是模型大小）
reserved_mem += max(self.chunked_prefill_size, 2048) * 1.5
# CUDA 图内存（基于 cuda_graph_max_bs，不是模型大小）
reserved_mem += self.cuda_graph_max_bs * 2

模型大小的影响

模型大小主要影响：

静态内存分配：通过 --mem-fraction-static 控制
KV cache 容量：模型越大，相同 token 数量需要的 KV cache 越多

但动态内存需求（激活值、临时缓冲区）主要取决于推理时的批次和序列配置，这就是为什么 5-8GB 的预留值具有通用性。 5

Notes

这种设计的好处是简化了调优：无论运行 7B 还是 70B 模型，动态内存的预留策略都是一致的，您只需要调整 --mem-fraction-static 来适应不同模型的静态内存需求。

Wiki pages you might want to explore:

Overview (sgl-project/sglang)

Citations

File: docs/advanced_features/hyperparameter_tuning.md (L31-37)

Total memory usage = model weights + KV cache pool + CUDA graph buffers + activations

The `--mem-fraction-static` parameter determines how much memory is allocated to the first two components:

mem_fraction_static = (model weights + KV cache pool) / GPU memory capacity

To support higher concurrency, you should maximize the KV cache pool capacity by setting `--mem-fraction-static` as high as possible while still reserving enough memory for activations and CUDA graph buffers.

File: docs/advanced_features/hyperparameter_tuning.md (L40-50)

As a rule of thumb, reserving 5–8 GB of memory for activations is typically sufficient. You can check this by inspecting the logs just before the server is ready.
Look for log entries like this:

[2025-08-11 17:17:03] max_total_num_tokens=665690, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=65536, available_gpu_mem=13.50 GB


Check the `available_gpu_mem` value.
- If it is between 5–8 GB, the setting is good.
- If it is too high (e.g., 10 - 20 GB), increase `--mem-fraction-static` to allocate more memory to the KV cache.
- If it is too low, you risk out-of-memory (OOM) errors later, so decrease `--mem-fraction-static`.

File: python/sglang/srt/server_args.py (L812-823)

        if self.mem_fraction_static is None:
            # Constant meta data (e.g., from attention backend)
            reserved_mem = 512
            # For activation during large prefill
            if self.chunked_prefill_size > 0:
                reserved_mem += max(self.chunked_prefill_size, 2048) * 1.5
            else:
                reserved_mem += max(self.max_prefill_tokens, 2048) * 1.5
            # For cuda graphs
            reserved_mem += self.cuda_graph_max_bs * 2
            # Some adjustments for large parallel size
            reserved_mem += self.tp_size * self.pp_size / 8 * 1024

加入AMD AI开发者计划！

免费领 200 小时云算力，进群参与显卡、AI PC 幸运抽奖

更多推荐

FlagOS Day 0 跨芯适配 GLM-5.2：发布即覆盖四款芯片，支持 vLLM + SGLang双插件

同时，FlagGems 新增 6 大领域算子库——FlagDNN、FlagBlas、FlagSparse、FlagFFT、FlagTensor、FlagAudio，覆盖科学计算与信号处理场景，共计 102 个领域算子，从"大模型专用"走向全领域覆盖。厂商目录放置后由插件自动发现加载，vLLM-Plugin-FL、SGLang-Plugin-FL、Megatron-LM-FL、Transformer