动手学大模型：RLHF

文章摘要：本文介绍了使用PPO（Proximal Policy Optimization）算法进行大模型RLHF（Reinforcement Learning from Human Feedback）训练的方法。重点展示了如何微调GPT-2模型生成积极电影评论的实验过程：1）通过BERT情感分类器作为奖励函数；2）使用IMDB数据集中的评论起始部分生成后续内容；3）利用分类器输出作为PPO训练的

ju7ran

496人浏览 · 2025-09-30 22:18:28

ju7ran · 2025-09-30 22:18:28 发布

动手学大模型：RLHF

PPO 如何运作

Rollout：语言模型根据 query 生成响应。
Evaluation：查询和响应使用函数、模型、人工反馈或它们的某种组合进行评估。此过程应为每个查询/响应对生成一个标量值。
Optimization：在优化步骤中，查询/响应对用于计算序列中标记的对数概率。这是通过训练的模型和参考模型完成的。两个输出之间的 KL 散度用作额外的奖励信号，以确保生成的响应不会偏离参考语言模型太远。然后使用 PPO 训练主动语言模型。

微调 GPT-2 以生成积极评论

通过使用 BERT 情感分类器作为奖励函数，优化 GPT-2 以生成积极的 IMDB 电影评论。

在这里插入图片描述

我们微调 GPT-2 以基于 IMDB 数据集生成积极的电影评论。该模型会接收真实评论的开头部分，并需要生成积极的后续内容。为了奖励积极的后续内容，我们使用 BERT 分类器来分析生成句子的情感，并将分类器的输出作为 PPO 训练的奖励信号。

实验设置

下载模型和数据

数据集

export HF_ENDPOINT=https://hf-mirror.com; huggingface-cli download --resume-download stanfordnlp/imdb --local-dir dataset/imdb --repo-type dataset

参考模型

export HF_ENDPOINT=https://hf-mirror.com; huggingface-cli download --resume-download lvwerra/gpt2-imdb --local-dir model/gpt2-imdb

奖励模型

export HF_ENDPOINT=https://hf-mirror.com; huggingface-cli download --resume-download lvwerra/distilbert-imdb --local-dir model/distilbert-imdb

导入依赖项

# %pip install -r requirements.txt
# import os
# os.environ['CUDA_VISIBLE_DEVICES'] = '7'

import torch
from tqdm import tqdm
import pandas as pd

tqdm.pandas()

from transformers import pipeline, AutoTokenizer
from datasets import load_dataset

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler

配置

config = PPOConfig(
    model_name="model/gpt2-imdb",
    learning_rate=1.41e-5,
    log_with="wandb",
)

sent_kwargs = {"top_k": None, "function_to_apply": "none", "batch_size": 16}

import wandb

wandb.init()

你可以看到我们加载了一个名为 gpt2_imdb 的 GPT-2 模型。该模型在 IMDB 数据集上额外微调了 1 个 epoch，使用的是 Hugging Face 的脚本（无特殊设置）。其余参数主要取自原始论文《Fine-Tuning Language Models from Human Preferences》。该模型以及 BERT 模型均可在 Hugging Face 的模型库中获取，具体链接在这里。

加载数据和模型

加载 IMDB 数据集

IMDB 数据集包含了 50,000 条电影评论，并标注了“积极”/“消极”的情感反馈。我们将 IMDB 数据集加载到一个 DataFrame 中，并筛选出至少 200 个字符的评论。然后，我们对每条文本进行分词，并使用 LengthSampler 将其随机截断为指定长度。

def build_dataset(
    config,
    dataset_name="dataset/imdb",
    input_min_text_length=2,
    input_max_text_length=8,
):
    """
    Build dataset for training. This builds the dataset from `load_dataset`, one should
    customize this function to train the model on its own dataset.

    Args:
        dataset_name (`str`):
            The name of the dataset to be loaded.

    Returns:
        dataloader (`torch.utils.data.DataLoader`):
            The dataloader for the dataset.
    """
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    tokenizer.pad_token = tokenizer.eos_token
    # load imdb with datasets
    ds = load_dataset(dataset_name, split="train")
    ds = ds.rename_columns({"text": "review"})
    ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds

dataset = build_dataset(config)


def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

加载预训练的 GPT2 语言模型

我们加载带有值头（value head）的 GPT2 模型和分词器。我们加载了两次模型；第一个模型用于优化，而第二个模型作为参考，用于计算与初始点的 KL 散度（KL-divergence）。这在 PPO 训练中作为额外的奖励信号，以确保优化后的模型不会偏离原始语言模型太远。

model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)

tokenizer.pad_token = tokenizer.eos_token

初始化 PPOTrainer

PPOTrainer 负责后续的设备分配和优化：

ppo_trainer = PPOTrainer(
    config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator
)

加载 BERT 分类器

我们加载了一个在 IMDB 数据集上微调过的 BERT 分类器。

device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu"  # to avoid a `pipeline` bug
sentiment_pipe = pipeline(
    "sentiment-analysis", model="model/distilbert-imdb", device=device
)

Device set to use cuda:0

模型输出的是负面类和正面类的 logits。我们将使用正面类的 logits 作为语言模型的奖励信号。

text = "this movie was really bad!!"
sentiment_pipe(text, **sent_kwargs)

[{'label': 'NEGATIVE', 'score': 2.3350484371185303},
 {'label': 'POSITIVE', 'score': -2.726576089859009}]

text = "this movie was really good!!"
sentiment_pipe(text, **sent_kwargs)

[{'label': 'POSITIVE', 'score': 2.557040214538574},
 {'label': 'NEGATIVE', 'score': -2.294790267944336}]

生成设置

对于响应生成，我们仅使用采样方法，并确保关闭 top-k 和核采样（nucleus sampling），同时设置一个最小长度。

gen_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

优化模型

训练循环

训练循环包括以下主要步骤：

从策略网络（GPT-2）中获取查询响应
从 BERT 中获取查询/响应的情感
使用 PPO 优化策略，利用（查询、响应、奖励）三元组

output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)


generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}


for epoch, batch in enumerate(tqdm(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    #### Get response from gpt2
    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        query_response = ppo_trainer.generate(query, **generation_kwargs).squeeze()
        response_len = len(query_response) - len(query)
        response_tensors.append(query_response[-response_len:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    #### Compute sentiment score
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    positive_scores = [
        item["score"]
        for output in pipe_outputs
        for item in output
        if item["label"] == "POSITIVE"
    ]
    rewards = [torch.tensor(score) for score in positive_scores]

    #### Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

  0%|          | 0/194 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
  4%|▍         | 8/194 [01:23<32:18, 10.42s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 194/194 [35:19<00:00, 10.92s/it]

训练进展

如果你正在使用 Weights & Biases 跟踪训练进展，你应该会看到类似于下图的曲线。查看 wandb.ai 上的交互式示例报告：链接。

在这里插入图片描述

可以观察到，经过几次优化步骤后，模型开始生成更积极的输出。

模型检查

让我们从 IMDB 数据集中检查一些示例。我们可以使用 ref_model 来比较优化后的模型 model 与优化前的模型。

#### get a batch from the dataset
bs = 16
game_data = dict()
dataset.set_format("pandas")
df_batch = dataset[:].sample(bs)
game_data["query"] = df_batch["query"].tolist()
query_tensors = df_batch["input_ids"].tolist()

response_tensors_ref, response_tensors = [], []

#### get response from gpt2 and gpt2_ref
for i in range(bs):
    query = torch.tensor(query_tensors[i]).to(device)

    gen_len = output_length_sampler()
    query_response = ref_model.generate(
        query.unsqueeze(0), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()
    response_len = len(query_response) - len(query)
    response_tensors_ref.append(query_response[-response_len:])

    query_response = model.generate(
        query.unsqueeze(0), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()
    response_len = len(query_response) - len(query)
    response_tensors.append(query_response[-response_len:])

#### decode responses
game_data["response (before)"] = [
    tokenizer.decode(response_tensors_ref[i]) for i in range(bs)
]
game_data["response (after)"] = [
    tokenizer.decode(response_tensors[i]) for i in range(bs)
]

#### sentiment analysis of query/response pairs before/after
texts = [q + r for q, r in zip(game_data["query"], game_data["response (before)"])]
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
positive_scores = [
    item["score"]
    for output in pipe_outputs
    for item in output
    if item["label"] == "POSITIVE"
]
game_data["rewards (before)"] = positive_scores

texts = [q + r for q, r in zip(game_data["query"], game_data["response (after)"])]
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
positive_scores = [
    item["score"]
    for output in pipe_outputs
    for item in output
    if item["label"] == "POSITIVE"
]
game_data["rewards (after)"] = positive_scores

# store results in a dataframe
df_results = pd.DataFrame(game_data)
df_results

	query	response (before)	response (after)	rewards (before)	rewards (after)
0	Well I guess I know	that Cantor may be an	..but I loved it	0.230196	2.281557
1	This is an excellent,	direct-to-video film with typical	enjoyable movie.<\|endoftext\|>	2.846593	2.840860
2	Now, I	've never had the chance with James	loved the growing episode - and the	0.656194	2.525894
3	We tend	not to see Arthur	to like this very	-0.280880	2.183822
4	The proverb "Never judge a book	by the cover" has caught on. After glancing t...	with high compliments, but it is recommended ...	0.274649	2.065951
5	I've never understood	why so many artsmen,	this film but it's delightful	0.835574	2.782384
6	Hugh (Ed Harris) is	an acclaimed "hero" and his fian	a wonderful actor who is a good adaptation	1.580167	2.602940
7	This particular Joe McDoakes	' episode brought all the wrong bits and	movie is really a great movie. It	0.870956	2.795245
8	Sisters In	Vrooms 8.23, I signed up for all of the	The Universe 1: Sunny is cute, and has a cute...	1.175259	2.062330
9	I was very fond of this	film, it was obviously a bad idea when first ...	show, and know that I have seen it several times	1.058164	2.511273
10	If he wanted to be	funny, he could	a genius eventually,	-0.388943	0.405888
11	Thats My	Grade...<br /><br />Although	Way was the best movie that I watched.	-0.151680	2.473050
12	This is possibly the best short	film I have come across in almost two years.	film ever written. It has some very memorable...	2.511835	2.775994
13	Some people say this is	exactly what happens in Hollywood; where come...	a powerful film to listen to. It really captures	0.637631	2.821085
14	A remake of	"The Wizard of Oz	the legendary Kingan oil	0.292409	0.434021
15	What a terrible	movie!<\|endoftext\|>	chopping sounded so good, I love it! We have a	-2.681461	2.340650

通过观察生成序列的奖励均值/中位数，我们发现了显著的差异。

print("mean:")
display(df_results[["rewards (before)", "rewards (after)"]].mean())
print()
print("median:")
display(df_results[["rewards (before)", "rewards (after)"]].median())

mean:

rewards (before)    0.591666
rewards (after)     2.243934
dtype: float64

median:

rewards (before)    0.646912
rewards (after)     2.492161
dtype: float64

保存模型

最后，我们保存模型以供后续使用。

model.save_pretrained("model/gpt2-imdb-pos-v2")
tokenizer.save_pretrained("model/gpt2-imdb-pos-v2")

('model/gpt2-imdb-pos-v2/tokenizer_config.json',
 'model/gpt2-imdb-pos-v2/special_tokens_map.json',
 'model/gpt2-imdb-pos-v2/vocab.json',
 'model/gpt2-imdb-pos-v2/merges.txt',
 'model/gpt2-imdb-pos-v2/added_tokens.json',
 'model/gpt2-imdb-pos-v2/tokenizer.json')

北京朝阳AI社区

更多推荐

Agentic AI技术伦理的商业应用，提示工程架构师的考量

在Agentic AI的技术栈中，提示工程是“最靠近应用层的伦理控制手段”——它不需要修改模型参数（比如微调），而是通过自然语言指令直接定义Agent的行为边界。Agentic AI的商业价值，从来不是“更智能的工具”，而是“更懂人的合作者”。而“懂人”的核心，就是尊重人的权益、遵守伦理规范。作为提示工程架构师，你不是“给Agent写指令的人”，而是“给Agent注入价值观的人”——你的每一行提示

北京朝阳AI社区

让 AI 帮我部署网站，太方便了！

以前，我们开发一个网站，需要先提出需求、再设计方案、再写代码开发、最后部署上线。我们都知道 AI 现在写代码越来越溜了，慢慢地替代了我们程序员写代码的部分工作；但其实，AI 的能力不止于此，利用 MCP，它甚至可以直接帮我们把网站部署上线！好好好，合着我，让我退化是吧？下面就来带大家实操一下，让 AI 帮我生成并部署个人博客网站，1 分钟学会~

北京朝阳AI社区

python模块导入冲突问题笔记

摘要项目遇到模块导入冲突问题：『src/mcp』本地模块与第三方第三方库同名。PyCharm Debug 启动时因自动将项目路径添加到sys.path前面，导致优先级错误加载本地模块，引发循环导入。通过分析sys.path差异，发现Pyaid的配置导致路径优先级问题。提供三类解决方案：1）重命名本地模块；2）调整PyCharm配置；3）sk.path优先级。最终采用方法二调整配置，解决了模块冲