别再死记硬背MDP公式了！用Python+PyTorch手搓一个“超级玛丽”AI，实战理解强化学习核心

本文通过Python+PyTorch实战演示如何构建超级玛丽AI，深入解析强化学习核心概念。从MDP框架搭建到深度Q网络(DQN)实现，涵盖Q-learning、经验回放等关键技术，帮助开发者通过游戏场景直观理解强化学习算法原理与工程实践。

Lord Diplock

265人浏览 · 2026-05-24 16:48:53

Lord Diplock · 2026-05-24 16:48:53 发布

用Python+PyTorch实战强化学习：从零构建超级玛丽AI

1. 为什么选择游戏作为强化学习的入门场景

经典游戏如超级玛丽是理解强化学习核心概念的绝佳试验场。在这个虚拟环境中，每个决策的后果立即可见：马里奥跳跃躲避敌人获得金币，每一步动作都会即时影响游戏状态并产生相应奖励。这种即时反馈机制完美模拟了强化学习中的"行动-反馈"循环。

游戏环境提供了理想的学习平台，因为：

状态空间可视化 ：游戏画面本身就是状态的自然表示
动作空间明确 ：移动、跳跃等基本动作对应离散的动作空间
奖励信号清晰 ：得分、生命值等游戏机制天然定义了奖励函数
失败条件明确 ：掉入陷阱或碰到敌人等终止条件定义了episode边界

import gym
env = gym.make('SuperMarioBros-v0')
state = env.reset()  # 获取初始游戏画面

2. 搭建强化学习基础架构

2.1 定义马尔可夫决策过程(MDP)要素

超级玛丽游戏可以形式化为以下MDP组件：

MDP元素	游戏对应物	Python表示
状态(s)	游戏画面	210x160x3的RGB数组
动作(a)	控制器输入	Discrete(6)对应6种按钮组合
奖励(r)	游戏得分	每帧变化的数值
转移概率	游戏物理引擎	env.step()函数

class MarioMDP:
    def __init__(self):
        self.action_space = [
            'NOOP', 'RIGHT', 'LEFT', 
            'UP', 'JUMP', 'DOWN'
        ]
        
    def step(self, action):
        # 与游戏引擎交互
        next_state, reward, done, info = env.step(action)
        return next_state, reward, done

2.2 实现Q-learning算法核心

Q-learning通过维护一个Q-table来估计状态-动作对的价值：

import numpy as np

class QLearner:
    def __init__(self, state_size, action_size):
        self.q_table = np.zeros((state_size, action_size))
        self.alpha = 0.1  # 学习率
        self.gamma = 0.9  # 折扣因子
        
    def update(self, state, action, reward, next_state):
        current_q = self.q_table[state][action]
        max_next_q = np.max(self.q_table[next_state])
        new_q = current_q + self.alpha * (
            reward + self.gamma * max_next_q - current_q)
        self.q_table[state][action] = new_q

提示：在实际应用中，游戏状态空间太大无法用表格表示，后续我们会用神经网络替代Q-table

3. 从表格方法到深度强化学习

3.1 用PyTorch构建深度Q网络(DQN)

传统Q-learning面临"维度灾难"，我们需要用神经网络近似Q函数：

import torch
import torch.nn as nn

class DQN(nn.Module):
    def __init__(self, input_shape, n_actions):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU()
        )
        
        conv_out_size = self._get_conv_out(input_shape)
        self.fc = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions)
        )
        
    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))
    
    def forward(self, x):
        conv_out = self.conv(x).view(x.size()[0], -1)
        return self.fc(conv_out)

3.2 实现经验回放机制

DQN使用经验回放(buffer)来提高数据利用率：

from collections import deque
import random

class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
        
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
        
    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)
    
    def __len__(self):
        return len(self.buffer)

4. 训练策略与性能优化

4.1 完整的训练循环实现

结合DQN和经验回放，我们得到完整训练流程：

def train(env, model, optimizer, buffer, batch_size=32, gamma=0.99):
    state = env.reset()
    episode_reward = 0
    
    while True:
        # ε-greedy动作选择
        if random.random() < epsilon:
            action = env.action_space.sample()
        else:
            state_t = torch.FloatTensor(state).unsqueeze(0)
            q_values = model(state_t)
            action = torch.argmax(q_values).item()
        
        # 执行动作并存储转移
        next_state, reward, done, _ = env.step(action)
        buffer.push(state, action, reward, next_state, done)
        episode_reward += reward
        
        # 从buffer采样训练
        if len(buffer) >= batch_size:
            transitions = buffer.sample(batch_size)
            batch = list(zip(*transitions))
            
            states = torch.FloatTensor(batch[0])
            actions = torch.LongTensor(batch[1])
            rewards = torch.FloatTensor(batch[2])
            next_states = torch.FloatTensor(batch[3])
            dones = torch.FloatTensor(batch[4])
            
            current_q = model(states).gather(1, actions.unsqueeze(1))
            next_q = model(next_states).max(1)[0].detach()
            target_q = rewards + gamma * next_q * (1 - dones)
            
            loss = nn.MSELoss()(current_q.squeeze(), target_q)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        state = next_state
        if done:
            state = env.reset()
            print(f"Episode reward: {episode_reward}")
            episode_reward = 0

4.2 高级优化技巧

为提高训练稳定性，我们可以引入以下改进：

目标网络(Target Network) ：

target_net = DQN(input_shape, n_actions)
target_net.load_state_dict(model.state_dict())

# 定期更新目标网络
if step % target_update == 0:
    target_net.load_state_dict(model.state_dict())

Double DQN ：

next_actions = model(next_states).max(1)[1]
next_q = target_net(next_states).gather(1, next_actions.unsqueeze(1))

优先级经验回放 ：

# 根据TD误差计算样本优先级
td_error = (target_q - current_q).abs().detach()
buffer.update_priorities(indices, td_error.numpy())

5. 可视化与结果分析

5.1 训练过程监控

使用TensorBoard记录关键指标：

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()
writer.add_scalar('Loss/train', loss.item(), step)
writer.add_scalar('Reward/episode', episode_reward, episode)

5.2 游戏AI行为分析

训练完成后，我们可以观察AI学到的策略：

def evaluate(model, env, episodes=10):
    for _ in range(episodes):
        state = env.reset()
        done = False
        while not done:
            env.render()
            state_t = torch.FloatTensor(state).unsqueeze(0)
            action = torch.argmax(model(state_t)).item()
            state, _, done, _ = env.step(action)

典型的学习行为演进过程：