1. sglang环境搭建

雷动工作室

217人浏览 · 2026-06-15 09:10:00

雷动工作室 · 2026-06-15 09:10:00 发布

前言

最近项目中用到了推理服务框架，分别尝试了ollama，vllm和sglang，最终用了sglang，所以记录一下搭建流程
官方文档https://docs.sglang.io/ 环境为ubuntu22.04 cuda版本12.6
有一个sglang的访谈值得一看朱邦华: SGLang，强化学习，英伟达收购，二次创业，清华，伯克利，LMSYS，Chatbot Arena，善于放弃

搭建步骤

新建并激活conda环境
conda create -n sglang python=3.10 -y
conda activate sglang
升级pip
pip install --upgrade pip
安装sglang 版本为Version: 0.5.12.post1
pip install "sglang[all]" --index-url https://download.pytorch.org/whl/cu124 --extra-index-url https://pypi.tuna.tsinghua.edu.cn/simple
启动命令
python3 -m sglang.launch_server --model-path /workspace/models/Qwen2.5-7B-Instruct/ --host 0.0.0.0 --port 30000
启动报错

报错1

(sglang) t@user:~$ python3 -m sglang.launch_server --model-path /workspace/models/Qwen2.5-7B-Instruct/ --host 0.0.0.0 --port 30000

Traceback (most recent call last):

  File "/home/t/miniconda3/envs/sglang/lib/python3.10/runpy.py", line 187, in _run_module_as_main

    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)

  File "/home/t/miniconda3/envs/sglang/lib/python3.10/runpy.py", line 110, in _get_module_details

    __import__(pkg_name)

  File "/home/t/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/__init__.py", line 31, in <module>

    _apply_hf_patches()

  File "/home/t/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/utils/hf_transformers_patches.py", line 59, in apply_all

    _patch_removed_symbols()

  File "/home/t/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/utils/hf_transformers_patches.py", line 235, in _patch_removed_symbols

    from transformers.models.llama import modeling_llama

  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist

  File "/home/t/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 2306, in __getattr__

    value = self._get_module(name)

  File "/home/t/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 2446, in _get_module

    raise e

  File "/home/t/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 2444, in _get_module

    return importlib.import_module("." + module_name, self.__name__)

  File "/home/t/miniconda3/envs/sglang/lib/python3.10/importlib/__init__.py", line 126, in import_module

    return _bootstrap._gcd_import(name[level:], package, level)

  File "/home/t/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 25, in <module>

    from ...activations import ACT2FN

  File "/home/t/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/activations.py", line 22, in <module>

    from .integrations.hub_kernels import use_kernel_forward_from_hub

  File "/home/t/miniconda3/envs/sglang/lib/python3.10/site-packages/transformers/integrations/hub_kernels.py", line 89, in <module>

    "cuda": LayerRepository(

  File "/home/t/miniconda3/envs/sglang/lib/python3.10/site-packages/kernels/layer/layer.py", line 76, in __init__

    raise ValueError("Either a revision or a version must be specified.")

ValueError: Either a revision or a version must be specified.

解决办法 pip install "kernels>=0.11.0,<0.12.0" --no-deps --force-reinstall
在较新版的 transformers 中，引入了一个名为 use_kernel_forward_from_hub 的特性（它会去自动拉取 Hugging Face Hub 上的优化版 CUDA 算子内核）。
问题出在，它依赖了一个叫 kernels 的基础 Python 库。就在最近，这个 kernels 库的官方在发布新版本时，把底层 LayerRepository.init 的参数校验改严格了：现在必须强制传入 revision（分支名）或 version（版本号）。而 transformers 的稳定版代码在调用它时，依然按照老习惯传了空值。
启动报错2

Capturing batches (bs=32 avail_mem=5.87 GB):   0%|                                                                                                                           | 0/8 [01:21<?, ?it/s]
[2026-06-12 11:04:11] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/model_executor/cuda_graph_runner.py", line 736, in __init__
    self.capture()
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/model_executor/cuda_graph_runner.py", line 918, in capture
    _capture_one_stream()
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/model_executor/cuda_graph_runner.py", line 906, in _capture_one_stream
    ) = self.capture_one_batch_size(bs, forward, stream_idx)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/model_executor/cuda_graph_runner.py", line 1187, in capture_one_batch_size
    run_once()
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/model_executor/cuda_graph_runner.py", line 1165, in run_once
    logits_output_or_pp_proxy_tensors = forward(
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/models/qwen2.py", line 486, in forward
    hidden_states = self.model(
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/models/qwen2.py", line 367, in forward
    hidden_states, residual = layer(
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/models/qwen2.py", line 261, in forward
    hidden_states = self.mlp(hidden_states)
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/models/qwen2.py", line 103, in forward
    x = self.act_fn(gate_up)
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/layers/utils/multi_platform.py", line 83, in forward
    return self._forward_method(*args, **kwargs)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/layers/activation.py", line 94, in forward_cuda
    silu_and_mul(x, out)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/activation.py", line 109, in silu_and_mul
    return run_activation("silu", input, out, expert_ids, expert_step)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/activation.py", line 97, in run_activation
    _run_activation_inplace(op_name, input, out)
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/torch/_ops.py", line 1269, in __call__
    return self._op(*args, **kwargs)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/activation.py", line 57, in _run_activation_inplace
    module = _jit_activation_module(input.dtype)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/utils.py", line 57, in wrapper
    result_map[key] = fn(*args, **kwargs)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/activation.py", line 34, in _jit_activation_module
    return load_jit(
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/utils.py", line 208, in load_jit
    return load_inline(
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/tvm_ffi/cpp/extension.py", line 1035, in load_inline
    build_inline(
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/tvm_ffi/cpp/extension.py", line 877, in build_inline
    return _build_impl(
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/tvm_ffi/cpp/extension.py", line 672, in _build_impl
    build_ninja(str(build_dir))
  File "/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/tvm_ffi/cpp/extension.py", line 542, in build_ninja
    raise RuntimeError("\n".join(msg))
RuntimeError: ninja exited with status 1
stdout:
[1/2] /usr/local/cuda-12.6/bin/nvcc  --generate-dependencies-with-compile --dependency-output cuda_0.o.d -Xcompiler -fPIC -std=c++17 -O2 -gencode=arch=compute_89,code=sm_89 -DSGL_CUDA_ARCH=890 -std=c++20 -O3 --expt-relaxed-constexpr --use_fast_math -I/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/tvm_ffi/include -I/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/tvm_ffi/include -I/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/include -c /home/t/.cache/tvm-ffi/sgl_kernel_jit_activation_bf16_t_false_3a2beb40cee02f8b/cuda.cu -o cuda_0.o
FAILED: [code=1] cuda_0.o 
/usr/local/cuda-12.6/bin/nvcc  --generate-dependencies-with-compile --dependency-output cuda_0.o.d -Xcompiler -fPIC -std=c++17 -O2 -gencode=arch=compute_89,code=sm_89 -DSGL_CUDA_ARCH=890 -std=c++20 -O3 --expt-relaxed-constexpr --use_fast_math -I/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/tvm_ffi/include -I/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/tvm_ffi/include -I/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/include -c /home/t/.cache/tvm-ffi/sgl_kernel_jit_activation_bf16_t_false_3a2beb40cee02f8b/cuda.cu -o cuda_0.o
nvcc warning : incompatible redefinition for option 'std', the last value of this option was used
nvcc warning : incompatible redefinition for option 'optimize', the last value of this option was used
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh(146): warning #2361-D: invalid narrowing conversion from "signed long" to "unsigned int"
          .hidden_dim = hidden_size,
                        ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh(146): warning #2361-D: invalid narrowing conversion from "signed long" to "unsigned int"
          .hidden_dim = hidden_size,
                        ^
          detected during instantiation of "void <unnamed>::ActivationKernel<T, kUsePDL>::run_activation(tvm::ffi::TensorView, tvm::ffi::TensorView, std::string) [with T=bf16_t, kUsePDL=false]" at line 8 of /home/t/.cache/tvm-ffi/sgl_kernel_jit_activation_bf16_t_false_3a2beb40cee02f8b/cuda.cu

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh: In static member function ‘static void _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::launch(const tvm::ffi::TensorView&, const tvm::ffi::TensorView&, const std::string&, const int32_t*, uint32_t)’:
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:153:42: error: no matching function for call to ‘_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::select_kernel<true>(const std::string&)’
  153 |       const auto kernel = select_kernel<true>(type);
      |                     ~~~~~~~~~~~~~~~~~~~~~^~~~~~
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:1: note: candidate: ‘template<class T, bool kUsePDL> template<bool kFilterExpert> static decltype (activation_kernel<_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind::kSiLU, kFilterExpert>) _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::select_kernel(const std::string&)’
   95 |   static auto select_kernel(const std::string& type)
      | ^ ~~~~~~~~~~~
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:1: note:   template argument deduction/substitution failed:
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh: In substitution of ‘template<class T, bool kUsePDL> template<bool kFilterExpert> static decltype (activation_kernel<_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind::kSiLU, kFilterExpert>) _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::select_kernel(const std::string&) [with bool kFilterExpert = <missing>; T = true; bool kUsePDL = <missing>]’:
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:153:42:   required from here
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:52: error: type/value mismatch at argument 1 in template parameter list for ‘template<class T, bool kUsePDL> template<_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind kAct, bool kFilterExpert> constexpr const auto _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::activation_kernel<kAct, kFilterExpert>’
   95 |   static auto select_kernel(const std::string& type)
      |                                                    ^                                                       
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:52: note:   expected a type, got ‘_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind::kSiLU’
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:156:43: error: no matching function for call to ‘_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::select_kernel<false>(const std::string&)’
  156 |       const auto kernel = select_kernel<false>(type);
      |                     ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:1: note: candidate: ‘template<class T, bool kUsePDL> template<bool kFilterExpert> static decltype (activation_kernel<_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind::kSiLU, kFilterExpert>) _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::select_kernel(const std::string&)’
   95 |   static auto select_kernel(const std::string& type)
      | ^ ~~~~~~~~~~~
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:1: note:   template argument deduction/substitution failed:
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh: In substitution of ‘template<class T, bool kUsePDL> template<bool kFilterExpert> static decltype (activation_kernel<_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind::kSiLU, kFilterExpert>) _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::select_kernel(const std::string&) [with bool kFilterExpert = <missing>; T = false; bool kUsePDL = <missing>]’:
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:156:43:   required from here
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:52: error: type/value mismatch at argument 1 in template parameter list for ‘template<class T, bool kUsePDL> template<_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind kAct, bool kFilterExpert> constexpr const auto _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::activation_kernel<kAct, kFilterExpert>’
   95 |   static auto select_kernel(const std::string& type)
      |                                                    ^                                                       
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:52: note:   expected a type, got ‘_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind::kSiLU’
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh: In instantiation of ‘static void _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::launch(const tvm::ffi::TensorView&, const tvm::ffi::TensorView&, const std::string&, const int32_t*, uint32_t) [with T = __nv_bfloat16; bool kUsePDL = false; std::string = std::__cxx11::basic_string<char>; int32_t = int; uint32_t = unsigned int]’:
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:162:9:   required from ‘static void _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::run_activation(tvm::ffi::TensorView, tvm::ffi::TensorView, std::string) [with T = __nv_bfloat16; bool kUsePDL = false; std::string = std::__cxx11::basic_string<char>]’
/home/t/.cache/tvm-ffi/sgl_kernel_jit_activation_bf16_t_false_3a2beb40cee02f8b/cuda.cu:8:427:   required from here
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:143:102: warning: narrowing conversion of ‘(long int)hidden_size’ from ‘long int’ to ‘uint32_t’ {aka ‘unsigned int’} [-Wnarrowing]
  143 |     const auto params = ActivationParams{
      |                                                                                                      ^          
ninja: build stopped: subcommand failed.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/managers/scheduler.py", line 4025, in run_scheduler_process
    scheduler = Scheduler(
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/managers/scheduler.py", line 437, in __init__
    self.init_model_worker()
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/managers/scheduler.py", line 718, in init_model_worker
    self.init_tp_model_worker()
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/managers/scheduler.py", line 673, in init_tp_model_worker
    self.tp_worker = TpModelWorker(**worker_kwargs)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/managers/tp_worker.py", line 262, in __init__
    self._init_model_runner()
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/managers/tp_worker.py", line 347, in _init_model_runner
    self._model_runner = ModelRunner(
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/model_executor/model_runner.py", line 535, in __init__
    self.initialize(pre_model_load_memory)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/model_executor/model_runner.py", line 791, in initialize
    self.init_device_graphs()
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/model_executor/model_runner.py", line 2965, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
  File "/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/srt/model_executor/cuda_graph_runner.py", line 738, in __init__
    raise Exception(
Exception: Capture cuda graph failed: ninja exited with status 1
stdout:
[1/2] /usr/local/cuda-12.6/bin/nvcc  --generate-dependencies-with-compile --dependency-output cuda_0.o.d -Xcompiler -fPIC -std=c++17 -O2 -gencode=arch=compute_89,code=sm_89 -DSGL_CUDA_ARCH=890 -std=c++20 -O3 --expt-relaxed-constexpr --use_fast_math -I/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/tvm_ffi/include -I/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/tvm_ffi/include -I/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/include -c /home/t/.cache/tvm-ffi/sgl_kernel_jit_activation_bf16_t_false_3a2beb40cee02f8b/cuda.cu -o cuda_0.o
FAILED: [code=1] cuda_0.o 
/usr/local/cuda-12.6/bin/nvcc  --generate-dependencies-with-compile --dependency-output cuda_0.o.d -Xcompiler -fPIC -std=c++17 -O2 -gencode=arch=compute_89,code=sm_89 -DSGL_CUDA_ARCH=890 -std=c++20 -O3 --expt-relaxed-constexpr --use_fast_math -I/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/tvm_ffi/include -I/home/t/miniconda3/envs/sglang_src/lib/python3.10/site-packages/tvm_ffi/include -I/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/include -c /home/t/.cache/tvm-ffi/sgl_kernel_jit_activation_bf16_t_false_3a2beb40cee02f8b/cuda.cu -o cuda_0.o
nvcc warning : incompatible redefinition for option 'std', the last value of this option was used
nvcc warning : incompatible redefinition for option 'optimize', the last value of this option was used
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh(146): warning #2361-D: invalid narrowing conversion from "signed long" to "unsigned int"
          .hidden_dim = hidden_size,
                        ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh(146): warning #2361-D: invalid narrowing conversion from "signed long" to "unsigned int"
          .hidden_dim = hidden_size,
                        ^
          detected during instantiation of "void <unnamed>::ActivationKernel<T, kUsePDL>::run_activation(tvm::ffi::TensorView, tvm::ffi::TensorView, std::string) [with T=bf16_t, kUsePDL=false]" at line 8 of /home/t/.cache/tvm-ffi/sgl_kernel_jit_activation_bf16_t_false_3a2beb40cee02f8b/cuda.cu

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh: In static member function ‘static void _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::launch(const tvm::ffi::TensorView&, const tvm::ffi::TensorView&, const std::string&, const int32_t*, uint32_t)’:
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:153:42: error: no matching function for call to ‘_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::select_kernel<true>(const std::string&)’
  153 |       const auto kernel = select_kernel<true>(type);
      |                     ~~~~~~~~~~~~~~~~~~~~~^~~~~~
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:1: note: candidate: ‘template<class T, bool kUsePDL> template<bool kFilterExpert> static decltype (activation_kernel<_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind::kSiLU, kFilterExpert>) _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::select_kernel(const std::string&)’
   95 |   static auto select_kernel(const std::string& type)
      | ^ ~~~~~~~~~~~
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:1: note:   template argument deduction/substitution failed:
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh: In substitution of ‘template<class T, bool kUsePDL> template<bool kFilterExpert> static decltype (activation_kernel<_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind::kSiLU, kFilterExpert>) _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::select_kernel(const std::string&) [with bool kFilterExpert = <missing>; T = true; bool kUsePDL = <missing>]’:
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:153:42:   required from here
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:52: error: type/value mismatch at argument 1 in template parameter list for ‘template<class T, bool kUsePDL> template<_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind kAct, bool kFilterExpert> constexpr const auto _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::activation_kernel<kAct, kFilterExpert>’
   95 |   static auto select_kernel(const std::string& type)
      |                                                    ^                                                       
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:52: note:   expected a type, got ‘_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind::kSiLU’
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:156:43: error: no matching function for call to ‘_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::select_kernel<false>(const std::string&)’
  156 |       const auto kernel = select_kernel<false>(type);
      |                     ~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:1: note: candidate: ‘template<class T, bool kUsePDL> template<bool kFilterExpert> static decltype (activation_kernel<_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind::kSiLU, kFilterExpert>) _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::select_kernel(const std::string&)’
   95 |   static auto select_kernel(const std::string& type)
      | ^ ~~~~~~~~~~~
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:1: note:   template argument deduction/substitution failed:
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh: In substitution of ‘template<class T, bool kUsePDL> template<bool kFilterExpert> static decltype (activation_kernel<_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind::kSiLU, kFilterExpert>) _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::select_kernel(const std::string&) [with bool kFilterExpert = <missing>; T = false; bool kUsePDL = <missing>]’:
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:156:43:   required from here
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:52: error: type/value mismatch at argument 1 in template parameter list for ‘template<class T, bool kUsePDL> template<_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind kAct, bool kFilterExpert> constexpr const auto _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::activation_kernel<kAct, kFilterExpert>’
   95 |   static auto select_kernel(const std::string& type)
      |                                                    ^                                                       
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:95:52: note:   expected a type, got ‘_GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKind::kSiLU’
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh: In instantiation of ‘static void _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::launch(const tvm::ffi::TensorView&, const tvm::ffi::TensorView&, const std::string&, const int32_t*, uint32_t) [with T = __nv_bfloat16; bool kUsePDL = false; std::string = std::__cxx11::basic_string<char>; int32_t = int; uint32_t = unsigned int]’:
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:162:9:   required from ‘static void _GLOBAL__N__7f02c643_7_cuda_cu_fc7cb620::ActivationKernel<T, kUsePDL>::run_activation(tvm::ffi::TensorView, tvm::ffi::TensorView, std::string) [with T = __nv_bfloat16; bool kUsePDL = false; std::string = std::__cxx11::basic_string<char>]’
/home/t/.cache/tvm-ffi/sgl_kernel_jit_activation_bf16_t_false_3a2beb40cee02f8b/cuda.cu:8:427:   required from here
/home/t/lld/learn/sglang-0.5.12.post1/python/sglang/jit_kernel/csrc/elementwise/activation.cuh:143:102: warning: narrowing conversion of ‘(long int)hidden_size’ from ‘long int’ to ‘uint32_t’ {aka ‘unsigned int’} [-Wnarrowing]
  143 |     const auto params = ActivationParams{
      |                                                                                                      ^          
ninja: build stopped: subcommand failed.

Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

解决办法，在sglang issue中找到了解决方法 https://github.com/sgl-project/sglang/issues/25682
/home/t/miniconda3/envs/sglang/lib/python3.10/site-packages/sglang/jit_kernel/csrc/elementwise/activation.cuh文件中修改select_kernel函数，原来是

template <bool kFilterExpert>
  static auto select_kernel(const std::string& type)
      -> decltype(activation_kernel<ActivationKind::kSiLU, kFilterExpert>) 
  {
    using namespace host;
    if (type == "silu") {
      return activation_kernel<ActivationKind::kSiLU, kFilterExpert>;
    } else if (type == "gelu") {
      return activation_kernel<ActivationKind::kGELU, kFilterExpert>;
    } else if (type == "gelu_tanh") {
      return activation_kernel<ActivationKind::kGELUTanh, kFilterExpert>;
    } else {
      Panic("unsupported activation type: ", type);
    }
    return nullptr;
  }

去掉两行改成

  template <bool kFilterExpert>
  static auto select_kernel(const std::string& type)
      //-> decltype(activation_kernel<ActivationKind::kSiLU, kFilterExpert>) 
  {
    using namespace host;
    if (type == "silu") {
      return activation_kernel<ActivationKind::kSiLU, kFilterExpert>;
    } else if (type == "gelu") {
      return activation_kernel<ActivationKind::kGELU, kFilterExpert>;
    } else if (type == "gelu_tanh") {
      return activation_kernel<ActivationKind::kGELUTanh, kFilterExpert>;
    } else {
      Panic("unsupported activation type: ", type);
    }
    // return nullptr;
  }

安装完成启动sglang python3 -m sglang.launch_server --model-path /workspace/models/Qwen2.5-7B-Instruct/ --host 0.0.0.0 --port 30000

2026-06-12 12:15:57] Using default HuggingFace chat template with detected content format: string
[2026-06-12 12:16:10] Init torch distributed begin.
[2026-06-12 12:16:11] Init torch distributed ends. elapsed=0.22 s, mem usage=0.06 GB
[2026-06-12 12:16:12] Load weight begin. avail mem=46.20 GB
Multi-thread loading shards: 100% Completed | 4/4 [00:05<00:00,  1.30s/it]
[2026-06-12 12:16:18] Load weight end. elapsed=5.54 s, type=Qwen2ForCausalLM, avail mem=31.91 GB, mem usage=14.30 GB.
[2026-06-12 12:16:18] Using KV cache dtype: torch.bfloat16
[2026-06-12 12:16:18] KV Cache is allocated. #tokens: 470181, K size: 12.56 GB, V size: 12.56 GB
[2026-06-12 12:16:18] Memory pool end. avail mem=6.23 GB
[2026-06-12 12:16:18] Capture cuda graph begin. This can take up to several minutes. avail mem=5.67 GB
[2026-06-12 12:16:18] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32]
Capturing batches (bs=1 avail_mem=5.52 GB): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:43<00:00,  5.41s/it]
[2026-06-12 12:17:02] Capture cuda graph end. Time elapsed: 44.02 s. mem usage=0.18 GB. avail mem=5.49 GB.
[2026-06-12 12:17:02] Capture piecewise CUDA graph begin. avail mem=5.49 GB
[2026-06-12 12:17:02] Capture cuda graph num tokens [4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096]
Compiling num tokens (num_tokens=4096):   0%|                                                                                                                                       | 0/50 [00:00<?, ?it/s][2026-06-12 12:17:08] Compiling a graph for dynamic shape takes 0.74 s
[2026-06-12 12:17:14] Compiling a graph for dynamic shape takes 0.78 s
Compiling num tokens (num_tokens=4): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:17<00:00,  2.87it/s]
Capturing num tokens (num_tokens=4 avail_mem=4.73 GB): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  7.30it/s]
[2026-06-12 12:17:27] Capture piecewise CUDA graph end. Time elapsed: 24.80 s. mem usage=0.76 GB. avail mem=4.73 GB.
[2026-06-12 12:17:28] max_total_num_tokens=470181, chunked_prefill_size=4096, max_prefill_tokens=16384, max_running_requests=4096, context_len=32768, available_gpu_mem=4.73 GB
[2026-06-12 12:17:29] INFO:     Started server process [1009649]
[2026-06-12 12:17:29] INFO:     Waiting for application startup.
[2026-06-12 12:17:29] Using default chat sampling params from model generation config: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2026-06-12 12:17:29] INFO:     Application startup complete.
[2026-06-12 12:17:29] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2026-06-12 12:17:30] INFO:     127.0.0.1:32974 - "GET /model_info HTTP/1.1" 200 OK
[2026-06-12 12:17:30] Prefill batch, #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #pending-token: 0, cuda graph: True, input throughput (token/s): 0.07
[2026-06-12 12:17:31] INFO:     127.0.0.1:32986 - "POST /generate HTTP/1.1" 200 OK
[2026-06-12 12:17:31] The server is fired up and ready to roll!

测试

官方测试代码

import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="/workspace/models/Qwen2.5-7B-Instruct/",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print(response.choices[0].message.content)