动手学大模型:RLHF

PPO 如何运作

  1. Rollout:语言模型根据 query 生成响应。
  2. Evaluation:查询和响应使用函数、模型、人工反馈或它们的某种组合进行评估。此过程应为每个查询/响应对生成一个标量值
  3. Optimization:在优化步骤中,查询/响应对用于计算序列中标记的对数概率。这是通过训练的模型和参考模型完成的。两个输出之间的 KL 散度用作额外的奖励信号,以确保生成的响应不会偏离参考语言模型太远。然后使用 PPO 训练主动语言模型。
    在这里插入图片描述

微调 GPT-2 以生成积极评论

通过使用 BERT 情感分类器作为奖励函数,优化 GPT-2 以生成积极的 IMDB 电影评论。

在这里插入图片描述

我们微调 GPT-2 以基于 IMDB 数据集生成积极的电影评论。该模型会接收真实评论的开头部分,并需要生成积极的后续内容。为了奖励积极的后续内容,我们使用 BERT 分类器来分析生成句子的情感,并将分类器的输出作为 PPO 训练的奖励信号。

实验设置

下载模型和数据

数据集

export HF_ENDPOINT=https://hf-mirror.com; huggingface-cli download --resume-download stanfordnlp/imdb --local-dir dataset/imdb --repo-type dataset

参考模型

export HF_ENDPOINT=https://hf-mirror.com; huggingface-cli download --resume-download lvwerra/gpt2-imdb --local-dir model/gpt2-imdb

奖励模型

export HF_ENDPOINT=https://hf-mirror.com; huggingface-cli download --resume-download lvwerra/distilbert-imdb --local-dir model/distilbert-imdb

导入依赖项

# %pip install -r requirements.txt
# import os
# os.environ['CUDA_VISIBLE_DEVICES'] = '7'
import torch
from tqdm import tqdm
import pandas as pd

tqdm.pandas()

from transformers import pipeline, AutoTokenizer
from datasets import load_dataset

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler

配置

config = PPOConfig(
    model_name="model/gpt2-imdb",
    learning_rate=1.41e-5,
    log_with="wandb",
)

sent_kwargs = {"top_k": None, "function_to_apply": "none", "batch_size": 16}
import wandb

wandb.init()

你可以看到我们加载了一个名为 gpt2_imdb 的 GPT-2 模型。该模型在 IMDB 数据集上额外微调了 1 个 epoch,使用的是 Hugging Face 的脚本(无特殊设置)。其余参数主要取自原始论文《Fine-Tuning Language Models from Human Preferences》。该模型以及 BERT 模型均可在 Hugging Face 的模型库中获取,具体链接在这里

加载数据和模型

加载 IMDB 数据集

IMDB 数据集包含了 50,000 条电影评论,并标注了“积极”/“消极”的情感反馈。我们将 IMDB 数据集加载到一个 DataFrame 中,并筛选出至少 200 个字符的评论。然后,我们对每条文本进行分词,并使用 LengthSampler 将其随机截断为指定长度。

def build_dataset(
    config,
    dataset_name="dataset/imdb",
    input_min_text_length=2,
    input_max_text_length=8,
):
    """
    Build dataset for training. This builds the dataset from `load_dataset`, one should
    customize this function to train the model on its own dataset.

    Args:
        dataset_name (`str`):
            The name of the dataset to be loaded.

    Returns:
        dataloader (`torch.utils.data.DataLoader`):
            The dataloader for the dataset.
    """
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    tokenizer.pad_token = tokenizer.eos_token
    # load imdb with datasets
    ds = load_dataset(dataset_name, split="train")
    ds = ds.rename_columns({"text": "review"})
    ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds
dataset = build_dataset(config)


def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

加载预训练的 GPT2 语言模型

我们加载带有值头(value head)的 GPT2 模型和分词器。我们加载了两次模型;第一个模型用于优化,而第二个模型作为参考,用于计算与初始点的 KL 散度(KL-divergence)。这在 PPO 训练中作为额外的奖励信号,以确保优化后的模型不会偏离原始语言模型太远。

model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)

tokenizer.pad_token = tokenizer.eos_token

初始化 PPOTrainer

PPOTrainer 负责后续的设备分配和优化:

ppo_trainer = PPOTrainer(
    config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator
)

加载 BERT 分类器

我们加载了一个在 IMDB 数据集上微调过的 BERT 分类器。

device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu"  # to avoid a `pipeline` bug
sentiment_pipe = pipeline(
    "sentiment-analysis", model="model/distilbert-imdb", device=device
)
Device set to use cuda:0

模型输出的是负面类和正面类的 logits。我们将使用正面类的 logits 作为语言模型的奖励信号。

text = "this movie was really bad!!"
sentiment_pipe(text, **sent_kwargs)
[{'label': 'NEGATIVE', 'score': 2.3350484371185303},
 {'label': 'POSITIVE', 'score': -2.726576089859009}]
text = "this movie was really good!!"
sentiment_pipe(text, **sent_kwargs)
[{'label': 'POSITIVE', 'score': 2.557040214538574},
 {'label': 'NEGATIVE', 'score': -2.294790267944336}]

生成设置

对于响应生成,我们仅使用采样方法,并确保关闭 top-k 和核采样(nucleus sampling),同时设置一个最小长度。

gen_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

优化模型

训练循环

训练循环包括以下主要步骤:

  1. 从策略网络(GPT-2)中获取查询响应
  2. 从 BERT 中获取查询/响应的情感
  3. 使用 PPO 优化策略,利用(查询、响应、奖励)三元组
output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)


generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}


for epoch, batch in enumerate(tqdm(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    #### Get response from gpt2
    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        query_response = ppo_trainer.generate(query, **generation_kwargs).squeeze()
        response_len = len(query_response) - len(query)
        response_tensors.append(query_response[-response_len:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    #### Compute sentiment score
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    positive_scores = [
        item["score"]
        for output in pipe_outputs
        for item in output
        if item["label"] == "POSITIVE"
    ]
    rewards = [torch.tensor(score) for score in positive_scores]

    #### Run PPO step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)
  0%|          | 0/194 [00:00<?, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
  4%|▍         | 8/194 [01:23<32:18, 10.42s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 194/194 [35:19<00:00, 10.92s/it]

训练进展

如果你正在使用 Weights & Biases 跟踪训练进展,你应该会看到类似于下图的曲线。查看 wandb.ai 上的交互式示例报告:链接

在这里插入图片描述

可以观察到,经过几次优化步骤后,模型开始生成更积极的输出。

模型检查

让我们从 IMDB 数据集中检查一些示例。我们可以使用 ref_model 来比较优化后的模型 model 与优化前的模型。

#### get a batch from the dataset
bs = 16
game_data = dict()
dataset.set_format("pandas")
df_batch = dataset[:].sample(bs)
game_data["query"] = df_batch["query"].tolist()
query_tensors = df_batch["input_ids"].tolist()

response_tensors_ref, response_tensors = [], []

#### get response from gpt2 and gpt2_ref
for i in range(bs):
    query = torch.tensor(query_tensors[i]).to(device)

    gen_len = output_length_sampler()
    query_response = ref_model.generate(
        query.unsqueeze(0), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()
    response_len = len(query_response) - len(query)
    response_tensors_ref.append(query_response[-response_len:])

    query_response = model.generate(
        query.unsqueeze(0), max_new_tokens=gen_len, **gen_kwargs
    ).squeeze()
    response_len = len(query_response) - len(query)
    response_tensors.append(query_response[-response_len:])

#### decode responses
game_data["response (before)"] = [
    tokenizer.decode(response_tensors_ref[i]) for i in range(bs)
]
game_data["response (after)"] = [
    tokenizer.decode(response_tensors[i]) for i in range(bs)
]

#### sentiment analysis of query/response pairs before/after
texts = [q + r for q, r in zip(game_data["query"], game_data["response (before)"])]
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
positive_scores = [
    item["score"]
    for output in pipe_outputs
    for item in output
    if item["label"] == "POSITIVE"
]
game_data["rewards (before)"] = positive_scores

texts = [q + r for q, r in zip(game_data["query"], game_data["response (after)"])]
pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
positive_scores = [
    item["score"]
    for output in pipe_outputs
    for item in output
    if item["label"] == "POSITIVE"
]
game_data["rewards (after)"] = positive_scores

# store results in a dataframe
df_results = pd.DataFrame(game_data)
df_results
query response (before) response (after) rewards (before) rewards (after)
0 Well I guess I know that Cantor may be an ..but I loved it 0.230196 2.281557
1 This is an excellent, direct-to-video film with typical enjoyable movie.<|endoftext|> 2.846593 2.840860
2 Now, I 've never had the chance with James loved the growing episode - and the 0.656194 2.525894
3 We tend not to see Arthur to like this very -0.280880 2.183822
4 The proverb "Never judge a book by the cover" has caught on. After glancing t... with high compliments, but it is recommended ... 0.274649 2.065951
5 I've never understood why so many artsmen, this film but it's delightful 0.835574 2.782384
6 Hugh (Ed Harris) is an acclaimed "hero" and his fian a wonderful actor who is a good adaptation 1.580167 2.602940
7 This particular Joe McDoakes ' episode brought all the wrong bits and movie is really a great movie. It 0.870956 2.795245
8 Sisters In Vrooms 8.23, I signed up for all of the The Universe 1: Sunny is cute, and has a cute... 1.175259 2.062330
9 I was very fond of this film, it was obviously a bad idea when first ... show, and know that I have seen it several times 1.058164 2.511273
10 If he wanted to be funny, he could a genius eventually, -0.388943 0.405888
11 Thats My Grade...<br /><br />Although Way was the best movie that I watched. -0.151680 2.473050
12 This is possibly the best short film I have come across in almost two years. film ever written. It has some very memorable... 2.511835 2.775994
13 Some people say this is exactly what happens in Hollywood; where come... a powerful film to listen to. It really captures 0.637631 2.821085
14 A remake of "The Wizard of Oz the legendary Kingan oil 0.292409 0.434021
15 What a terrible movie!<|endoftext|> chopping sounded so good, I love it! We have a -2.681461 2.340650

通过观察生成序列的奖励均值/中位数,我们发现了显著的差异。

print("mean:")
display(df_results[["rewards (before)", "rewards (after)"]].mean())
print()
print("median:")
display(df_results[["rewards (before)", "rewards (after)"]].median())
mean:

rewards (before)    0.591666
rewards (after)     2.243934
dtype: float64


median:

rewards (before)    0.646912
rewards (after)     2.492161
dtype: float64

保存模型

最后,我们保存模型以供后续使用。

model.save_pretrained("model/gpt2-imdb-pos-v2")
tokenizer.save_pretrained("model/gpt2-imdb-pos-v2")
('model/gpt2-imdb-pos-v2/tokenizer_config.json',
 'model/gpt2-imdb-pos-v2/special_tokens_map.json',
 'model/gpt2-imdb-pos-v2/vocab.json',
 'model/gpt2-imdb-pos-v2/merges.txt',
 'model/gpt2-imdb-pos-v2/added_tokens.json',
 'model/gpt2-imdb-pos-v2/tokenizer.json')
Logo

更多推荐