第8章 - 强化学习与PyTorch 实训操作手册

1. 强化学习基础概念

目标:理解强化学习的核心概念及其与监督学习的区别。

内容:

a. 什么是强化学习?

强化学习是机器学习的一个子领域,其中智能体学习如何在环境中采取行动,以便最大化某种概念上的累积奖励。

  • b. 术语:
    • 智能体(Agent):在环境中采取行动的实体。
    • 环境(Environment):智能体与其互动的外部世界。
    • 奖励(Reward):每次行动后,智能体从环境中接收的反馈。
    • 策略(Policy):智能体决定采取何种行动的策略。

2. Q-learning与深度Q网络(DQN)

目标:理解Q-learning算法及其深度学习版本DQN。

内容:

a. Q-learning:

Q-learning是一种无模型的强化学习算法,用于寻找最优行动选择策略。

b. 深度Q网络(DQN):

  • 当状态空间很大或连续时,使用神经网络近似Q函数。DQN是Q-learning和深度学习的结合。

3. Policy Gradient

目标:理解策略梯度方法及其与值基方法的区别。

内容:

  • 策略梯度方法直接优化策略,而不是优化值函数。这种方法特别适用于连续行动空间或策略本身的结构复杂性。

4. 使用PyTorch实现强化学习算法

目标:学会使用PyTorch构建和训练强化学习模型。

内容:

实操:

  • 使用PyTorch实现DQN来解决CartPole问题。
import torch
import torch.nn as nn
import torch.optim as optim
import gym


# 定义DQN
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, output_dim)
        )
    
    def forward(self, x):
        return self.fc(x)


# 定义训练流程
def train_dqn(model, env):
    optimizer = optim.Adam(model.parameters())
    criterion = nn.MSELoss()
    for episode in range(1000):
        state = env.reset()
        done = False
        while not done:
            action = model(torch.tensor(state, dtype=torch.float32)).argmax().item()
            next_state, reward, done, _ = env.step(action)
            target = reward + model(torch.tensor(next_state, dtype=torch.float32)).max().item()
            prediction = model(torch.tensor(state, dtype=torch.float32))[action]
            loss = criterion(prediction, torch.tensor(target))
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            state = next_state


env = gym.make('CartPole-v0')
model = DQN(env.observation_space.shape[0], env.action_space.n)
train_dqn(model, env)

实战项目:使用强化学习训练一个游戏代理

项目描述:使用DQN训练一个代理来玩CartPole游戏。

1. 环境配置

首先,确保你已经安装了gym库。这个库提供了一个集合的强化学习任务。

pip install gym 

2. 导入必要的库

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
import gym

3. DQN网络结构

class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, output_dim)
        )
    
    def forward(self, x):
        return self.fc(x)

4. 记忆回放

为了稳定DQN的训练,我们使用经验回放技巧。

class ReplayBuffer:
    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []
        self.position = 0


    def push(self, state, action, reward, next_state, done):
        if len(self.memory) < self.capacity:
            self.memory.append(None)
        self.memory[self.position] = (state, action, reward, next_state, done)
        self.position = (self.position + 1) % self.capacity


    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)


    def __len__(self):
        return len(self.memory)

5. DQN训练流程

def train_dqn(model, env, buffer, episodes, batch_size=64, gamma=0.99):
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.MSELoss()
    
    for episode in range(episodes):
        state = env.reset()
        total_reward = 0
        done = False


        while not done:
            action = model(torch.tensor(state, dtype=torch.float32)).argmax().item()
            next_state, reward, done, _ = env.step(action)
            buffer.push(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward


            if len(buffer) > batch_size:
                batch = buffer.sample(batch_size)
                states, actions, rewards, next_states, dones = zip(*batch)
                states = torch.tensor(states, dtype=torch.float32)
                actions = torch.tensor(actions, dtype=torch.long)
                rewards = torch.tensor(rewards, dtype=torch.float32)
                next_states = torch.tensor(next_states, dtype=torch.float32)
                dones = torch.tensor(dones, dtype=torch.float32)


                q_values = model(states).gather(1, actions.unsqueeze(-1)).squeeze(-1)
                next_q_values = model(next_states).max(1)[0]
                target = rewards + gamma * next_q_values * (1 - dones)


                loss = criterion(q_values, target.detach())
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()


        if episode % 10 == 0:
            print(f"Episode: {episode}, Total Reward: {total_reward}")


env = gym.make('CartPole-v1')
model = DQN(env.observation_space.shape[0], env.action_space.n)
buffer = ReplayBuffer(10000)
train_dqn(model, env, buffer, 500)

6. 代理评估

训练完成后,你可以评估代理在CartPole游戏中的性能。

def evaluate_agent(model, env, episodes=10):
    total_rewards = []


    for episode in range(episodes):
        state = env.reset()
        total_reward = 0
        done = False


        while not done:
            env.render()
            action = model(torch.tensor(state, dtype=torch.float32)).argmax().item()
            next_state, reward, done, _ = env.step(action)
            state = next_state
            total_reward += reward


        total_rewards.append(total_reward)


    env.close()
    return np.mean(total_rewards)


avg_reward = evaluate_agent(model, env)
print(f"Average Reward over {episodes} episodes: {avg_reward}")

此代码将使用训练过的模型玩CartPole游戏,并显示模型的平均得分。你可以调整参数,如batch_sizegamma和训练的episodes数,以优化代理的性能。

Logo

汇聚原天河团队并行计算工程师、中科院计算所专家以及头部AI名企HPC专家,助力解决“卡脖子”问题

更多推荐