NVIDIA NeMo-Skills：一站式LLM能力增强平台，从原型到集群的无缝升级

姬鸿桢

439人浏览 · 2025-11-19 02:35:54

姬鸿桢 · 2025-11-19 02:35:54 发布

NVIDIA NeMo-Skills：一站式LLM能力增强平台，从原型到集群的无缝升级

【免费下载链接】OpenReasoning-Nemotron-7B 项目地址: https://ai.gitcode.com/hf_mirrors/nvidia/OpenReasoning-Nemotron-7B

大型语言模型（LLM）的能力提升是一个系统性工程，通常涵盖合成数据生成（SDG）、模型训练（如监督式微调SFT或强化学习RL）及性能评估等关键阶段。然而，每个阶段往往依赖不同的工具库，这些工具不仅配置复杂，还难以协同工作。例如，数据生成可能需要用到NVIDIA TensorRT-LLM或vLLM，模型训练则依赖NeMo或verl框架，这意味着用户需频繁切换脚本与容器环境，进行Hugging Face模型 checkpoint 转换、大规模数据生成、格式适配NeMo训练等繁琐操作。为解决这一痛点，NVIDIA推出了NeMo-Skills平台，通过高级抽象层实现不同框架的无缝衔接，支持从本地快速原型验证到Slurm集群大规模作业调度的全流程管理。本文将揭秘NVIDIA团队在AIMO2 Kaggle竞赛中夺冠的简化版技术流程——如何从一个数学推理能力有限的基础模型出发，通过NeMo-Skills的系列作业逐步强化其核心能力。

本地与Slurm集群环境配置

NeMo-Skills采用Docker容器化方案管理复杂作业流。无论是本地运行还是基于Slurm集群（需支持NVIDIA/pyxis插件），均需先安装NVIDIA Container Toolkit。推荐的最佳实践是：在本地工作站完成NeMo-Skills基础配置，并通过SSH连接Slurm集群，平台将自动处理代码上传与作业调度。执行以下命令完成初始化：

pip install git+https://gitcode.com/hf_mirrors/nvidia/OpenReasoning-Nemotron-7B.git
ns setup

配置过程中，当系统提示添加挂载目录时，需将工作目录定义为/workspace，后续所有操作将基于此路径展开（详细配置可参考NeMo-Skills官方文档）。本文示例命令默认使用--cluster=local参数，若部署于Slurm集群，需替换为--cluster=slurm（或自定义集群名称）。Slurm环境下，所有命令将即时返回并进入集群任务队列等待执行。此外，平台集成Weights & Biases（W&B）工具用于记录评估结果与模型输出，若需禁用，可移除所有与W&B相关的参数。

基准性能评估体系构建

在启动模型优化前，首要步骤是建立基准性能基线。本教程以Qwen2.5 14B Instruct模型为优化对象，采用AIME24和AIME25数据集评估其数学推理能力，推理加速依赖vLLM库。

模型与数据准备

首先下载基础模型并准备评估数据集：

# 下载Qwen2.5 14B Instruct模型
ns run_cmd --expname=download-14b --log_dir=/workspace/Qwen2.5-14B-Instruct --cluster=local \
  huggingface-cli download Qwen/Qwen2.5-14B-Instruct --local-dir /workspace/Qwen2.5-14B-Instruct

# 准备AIME24/25基准数据集
ns prepare_data aime24 aime25

基准测试执行

通过以下命令启动评估流程，平台将自动处理多轮推理（每个样本生成8个结果）并计算通过率指标：

ns eval \
  --cluster=local \
  --expname=baseline-eval \
  --run_after=download-14b \
  --model=/workspace/Qwen2.5-14B-Instruct \
  --server_type=vllm \
  --server_gpus=8 \
  --benchmarks=aime24:8,aime25:8 \
  --output_dir=/workspace/evals/baseline

# 生成评估报告
ns summarize_results --cluster=local /workspace/evals/baseline --wandb_name=baseline-evals

基准测试结果示例（受LLM生成随机性影响，实际数值可能略有波动）：

--------------------------------- aime24 --------------------------------
evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer
pass@1[8]       | 30          | 829        | 11.67%           | 0.00%
majority@8      | 30          | 829        | 13.33%           | 0.00%
pass@8          | 30          | 829        | 33.33%           | 0.00%

--------------------------------- aime25 --------------------------------
evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer
pass@1[8]       | 30          | 834        | 11.67%           | 0.42%
majority@8      | 30          | 834        | 20.00%           | 0.00%
pass@8          | 30          | 834        | 26.67%           | 0.00%

合成数据生成（SDG）全流程

为提升模型数学推理能力，需构建高质量合成训练数据。本方案基于OpenMathReasoning方法论，利用少量AoPS论坛讨论数据，通过Qwen2.5 14B Instruct提取数学问题，再使用QwQ 32B模型生成详细推理过程。尽管简化版流程暂未包含真值提取与正确性过滤等步骤，但其已能有效教会14B模型掌握长推理技巧，显著提升基准测试表现。

数据与工具准备

首先下载数据处理脚本、提示模板及原始论坛数据：

ns run_cmd --expname=prepare-data --log_dir=/workspace/prepare-data --cluster=local \
  'cd /workspace && \
   export DOWNLOAD_PREFIX=https://raw.githubusercontent.com/NVIDIA/NeMo-Skills/refs/heads/main/recipes/openmathreasoning && \
   wget $DOWNLOAD_PREFIX/scripts/prepare_raw_data.py && \
   wget $DOWNLOAD_PREFIX/prompts/extract-problems.yaml && \
   wget $DOWNLOAD_PREFIX/scripts/postprocess_problem_extraction.py && \
   python prepare_raw_data.py && \
   head -n 1000 raw_aops_data.jsonl > data.jsonl'

data.jsonl中的字段将用于填充extract-problems.yaml提示模板，最终生成的提示词将被传入LLM进行问题提取（提示词格式规范详见NeMo-Skills提示工程文档）。

基于Python API的生成流程

通过NeMo-Skills Python API实现端到端数据生成：

# run_sdg.py
from nemo_skills.pipeline.cli import generate, wrap_arguments

cluster = "local"
num_gpus = 8
postprocess_cmd = (
  f"python /workspace/postprocess_problem_extraction.py "
  f"/workspace/sdg/problems/output.jsonl "
  f"/workspace/sdg/extracted-problems.jsonl "
)

generate(
  ctx=wrap_arguments(
    f"++prompt_config=/workspace/extract-problems.yaml "
    f"++prompt_template=qwen-instruct "
  ),
  cluster=cluster,
  input_file="/workspace/data.jsonl",
  output_dir="/workspace/sdg/problems",
  postprocess_cmd=postprocess_cmd,
  expname="problem-extraction",
  run_after=["prepare-data", "download-14b"],
  model="/workspace/Qwen2.5-14B-Instruct",
  server_type="vllm",
  server_gpus=num_gpus,
  # W&B日志配置（移除则禁用）
  log_samples=True,
  wandb_group="sdg",
)

生成结果将保存于sdg/extracted-problems.jsonl，包含新增的extracted_problem字段。下一步使用QwQ 32B模型为这些问题生成详细解答：

# 下载QwQ 32B模型
ns run_cmd --expname=download-qwq --log_dir=/workspace/QwQ-32B --cluster=local \
  huggingface-cli download Qwen/QwQ-32B --local-dir /workspace/QwQ-32B

# 转换为TensorRT-LLM格式（加速长文本推理）
ns convert \
  --cluster=local \
  --expname=convert-qwq-trtllm \
  --run_after=download-qwq \
  --input_model=/workspace/QwQ-32B \
  --output_model=/workspace/qwq32b-trtllm \
  --convert_from=hf \
  --convert_to=trtllm \
  --num_gpus=8 \
  --model_type=qwen \
  --hf_model_name=Qwen/QwQ-32B \
  --max_seq_len 10000

在run_sdg.py中追加以下代码启动解答生成：

generate(
  ctx=wrap_arguments(
    f"++prompt_config=generic/math "
    f"++inference.temperature=0.6 "
    f"++inference.tokens_to_generate=8192 "
    f"++prompt_template=qwen-instruct "
  ),
  cluster=cluster,
  input_file="/workspace/sdg/extracted-problems.jsonl",
  output_dir="/workspace/sdg/solutions",
  expname="solution-generation",
  run_after=["problem-extraction", "convert-qwq-trtllm"],
  model="/workspace/qwq32b-trtllm",
  server_type="trtllm",
  server_gpus=num_gpus,
  log_samples=True,
  wandb_group="sdg",
)

大规模生成任务建议通过num_chunks=N参数实现多节点并行（详见NeMo-Skills生成文档）。生成结果可通过W&B平台可视化查看，如图1所示：

如上图所示，W&B控制面板的“Files”标签页下的samples.json文件展示了SDG流程生成的数学问题与解答样本。这些数据直观反映了合成数据的质量分布，帮助开发者快速判断数据生成策略的有效性，为后续模型训练提供关键参考依据。

NeMo框架模型训练实践

合成数据就绪后，即可启动模型微调流程。以下分别介绍NeMo-Aligner与NeMo-RL两种训练方案。

训练数据预处理

首先将生成的解答数据转换为NeMo兼容格式：

ns run_cmd --log_dir=/workspace/prepare-sft-data --expname=prepare-sft-data --run_after=solution-generation --cluster=local \
  'python -m nemo_skills.training.prepare_data \
   ++input_files=/workspace/sdg/solutions/output.jsonl \
   ++output_path=/workspace/sft-data.jsonl \
   ++prompt_config=generic/math \
   ++prompt_template=qwen-instruct \
   ++filters.remove_contaminated=false \
   ++add_unlabeled=true \
   ++filters.remove_no_think_tags=true \
   ++filters.trim_solutions=false'

模型格式转换

将基础模型转换为NeMo格式（NeMo-RL训练可跳过此步骤）：

ns convert \
  --cluster=local \
  --expname=convert-14b-nemo \
  --run_after=download-14b \
  --input_model=/workspace/Qwen2.5-14B-Instruct \
  --output_model=/workspace/qwen2.5-14b-instruct-nemo \
  --convert_from=hf \
  --convert_to=nemo \
  --num_gpus=8 \
  --model_type=qwen \
  --hf_model_name=Qwen/Qwen2.5-14B-Instruct

监督式微调（SFT）执行

NeMo-Aligner后端训练命令：

ns train \
  --cluster=local \
  --expname=training \
  --run_after=convert-14b-nemo \
  --run_after=prepare-sft-data \
  --output_dir=/workspace/training \
  --nemo_model=/workspace/qwen2.5-14b-instruct-nemo \
  --num_nodes=1 \
  --num_gpus=8 \
  --training_data=/workspace/sft-data.jsonl \
  ++model.data.train_ds.max_seq_length=8192 \
  ++model.data.train_ds.global_batch_size=32 \
  ++model.tensor_model_parallel_size=4 \
  ++model.context_parallel_size=2 \
  ++model.optim.lr=1e-5 \
  ++trainer.sft.max_epochs=2

NeMo-RL后端训练命令（无需NeMo格式转换）：

ns nemo_rl sft \
  --cluster=local \
  --expname=training \
  --run_after=download-14b \
  --run_after=prepare-sft-data \
  --output_dir=/workspace/training \
  --hf_model=/workspace/Qwen2.5-14B-Instruct \
  --num_nodes=1 \
  --num_gpus=8 \
  --training_data=/workspace/sft-data.jsonl \
  --cache_dir=/workspace/nemo-rl-cache \
  --final_hf_path=/workspace/training/qwen2.5-14b-improved-hf \
  ++sft.max_num_epochs=4 \
  ++policy.dtensor_cfg.tensor_parallel_size=8 \
  ++policy.max_total_sequence_length=8192 \
  ++policy.train_global_batch_size=32 \
  ++policy.optimizer.kwargs.lr=1e-5 \
  ++policy.dtensor_cfg.sequence_parallel=true \
  ++policy.dtensor_cfg.activation_checkpointing=true

训练过程中的损失曲线、学习率变化等关键指标可通过W&B平台实时监控，帮助开发者及时调整训练策略。

优化后模型性能验证

微调完成后，需将模型转换回Hugging Face格式（NeMo-RL训练可直接使用final_hf_path指定的输出路径）：

ns convert \
  --cluster=local \
  --expname=convert-14b-hf \
  --run_after=training \
  --input_model=/workspace/training/model-averaged-nemo \
  --output_model=/workspace/training/qwen2.5-14b-improved-hf \
  --convert_from=nemo \
  --convert_to=hf \
  --num_gpus=8 \
  --model_type=qwen \
  --hf_model_name=Qwen/Qwen2.5-14B-Instruct

执行最终评估：

ns eval \
  --cluster=local \
  --expname=final-eval \
  --run_after=convert-14b-hf \
  --model=/workspace/training/qwen2.5-14b-improved-hf \
  --server_type=vllm \
  --server_gpus=8 \
  --benchmarks=aime24:8,aime25:8 \
  --output_dir=/workspace/evals/after-training \
  ++inference.tokens_to_generate=16384

ns summarize_results --cluster=local /workspace/evals/after-training --wandb_name=after-training-evals

优化后性能提升（示例结果）：

--------------------------------- aime24 --------------------------------
evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer
pass@1[8]       | 30          | 13362      | 27.92%           | 55.83%
majority@8      | 30          | 13362      | 40.00%           | 16.67%
pass@8          | 30          | 13362      | 50.00%           | 16.67%

--------------------------------- aime25 --------------------------------
evaluation_mode | num_entries | avg_tokens | symbolic_correct | no_answer
pass@1[8]       | 30          | 13445      | 17.92%           | 53.33%
majority@8      | 30          | 13445      | 26.67%           | 10.00%
pass@8          | 30          | 13445      | 36.67%           | 10.00%

结果显示，优化后的模型在AIME24基准的pass@8指标从33.33%提升至50.00%，数学推理能力显著增强。