Diffusion model理论推导

Diffusion Models：生成扩散模型

Q&Cui

2734人浏览 · 2022-10-12 20:45:10

Q&Cui · 2022-10-12 20:45:10 发布

直观理解Diffusion model

生成式模型本质上是一组概率分布。如下图所示，左边是一个训练数据集，里面所有的数据 $p_{data}$ 都是从某个数据中独立同分布取出的随机样本。右边就是其生成式模型（概率分布），在这种概率分布中，找出一个分布 $p_θ$ 使得它离 $p_{data}$ 的距离最近。接着在上采新的样本，可以获得源源不断的新数据。
在这里插入图片描述
但是往往 $p_{data}$ 的形式是非常复杂的，而且图像的维度很高，我们很难遍历整个空间，同时我们能观测到的数据样本也有限。

Diffusion做的是什么事呢？

我们可以将任意分布，当然也包括我们感兴趣的 $p_{data}$ ，不断加噪声，使得他最终变成一个纯噪声分布 $N (0, I)$ 。怎么理解呢？

从概率分布的角度来看，考虑下图瑞士卷形状的二维联合概率分布 $p (x, y)$ ，扩散过程q非常直观，本来集中有序的样本点，受到噪声的扰动，向外扩散，最终变成一个完全无序的噪声分布。
概率分布的变化
而diffusion model其实是图上的这个逆过程，将一个噪声分布 $N (0, I)$ 逐步地去噪以映射到 $p_{data}$ ，有了这样的映射，我们从噪声分布中采样，最终可以得到一张想要的图像，也就是可以做生成了。

而从单个图像样本来看这个过程，扩散过程 $q$ 就是不断往图像上加噪声直到图像变成一个纯噪声，逆扩散过程就是从纯噪声生成一张图像的过程。
单个图像样本的变化

形式化解析Diffusion model

Diffusion Models 既然叫生成模型，这意味着 Diffusion Models 用于生成与训练数据相似的数据。从根本上说，Diffusion Models 的工作原理，是通过连续添加高斯噪声来破坏训练数据，然后通过反转这个噪声过程，来学习恢复数据。

训练后，可以使用 Diffusion Models 将随机采样的噪声传入模型中，通过学习去噪过程来生成数据。也就是下面图中所对应的基本原理，不过这里面的图仍然有点粗。

更具体地说，扩散模型是一种隐变量模型（latent variable model），使用马尔可夫链（Markov Chain, MC）映射到 latent space。通过马尔可夫链，在每一个时间步 t 中逐渐将噪声添加到数据 $x_{i}$ 中以获得后验概率 $q(x_{1:T} | x_0)$ ，其中 $x_1,...,x_T$ 代表输入的数据同时也是 latent space。也就是说 Diffusion Models 的 latent space与输入数据具有相同维度。

补充： 后验概率：在贝叶斯统计中，一个随机事件或者一个不确定事件的后验概率（Posterior probability）是在考虑和给出相关证据或数据后所得到的条件概率。

马尔可夫链为状态空间中经过从一个状态到另一个状态的转换的随机过程。该过程要求具备“无记忆”的性质：下一状态的概率分布只能由当前状态决定，在时间序列中它前面的事件均与之无关。这种特定类型的“无记忆性”称作马可夫性质。

Diffusion Models 分为正向的扩散过程和反向的逆扩散过程。下图为扩散过程，从 $x_T$ 到最后的就是一个马尔可夫链，表示状态空间中经过从一个状态到另一个状态的转换的随机过程。而下标则是 Diffusion Models 对应的图像扩散过程。
单个图像样本的变化
最终，从 $x_0$ 输入的真实图像，经过 Diffusion Models 后被渐近变换为纯高斯噪声的图片 $x_T$ 。

模型训练主要集中在逆扩散过程。训练扩散模型的目标是，学习正向的反过程：即训练概率分布 $p_θ(x_{t-1}|x_t)$ 。通过沿着马尔可夫链向后遍历，可以重新生成新的数据 $x_0$ 。

读到这里就有点意思啦，Diffusion Models 跟 GAN 或者 VAE 的最大区别在于不是通过一个模型来进行生成的，而是基于马尔可夫链，通过学习噪声来生成数据。

除了生成很好玩的高质量图片之外呢，Diffusion Models 还具有许多其他好处，其中最重要的是训练过程中没有对抗了，对于 GAN 网络模型来说，对抗性训练其实是很不好调试的，因为对抗训练过程互相博弈的两个模型，对我们来说是个黑盒子。另外在训练效率方面，扩散模型还具有可扩展性和可并行性，那这里面如何加速训练过程，如何添加更多数学规则和约束，扩展到语音、文本、三维领域就很好玩了，可以出很多新文章。

详解 Diffusion Model

上面已经清晰表示了 Diffusion Models 由正向过程（或扩散过程）和反向过程（或逆扩散过程）组成，其中输入数据逐渐被噪声化，然后噪声被转换回源目标分布的样本。

接下来会是一点点数学，只能说我尽量讲得简单一点，就是个马尔可夫链 + 条件概率分布。核心在于如何使用神经网络模型，来求解马尔可夫过程的概率分布。

Diffusion 前向过程(扩散过程)

所谓前向过程，即往图片上加噪声的过程。虽然这个步骤无法做到图片生成，但是这是理解 diffusion model 以及构建训练样本 GT 至关重要的一步。

给定真实图片样本 $x_0 \sim q(x)$ ，diffusion 前向过程通过 $T$ 次累计对其添加高斯噪声，得 $x_1,x_2,...,x_T$ ，如下图的 $q$ 过程。每一步的大小是由一系列的高斯分布方差的超参数 $\{\beta_t \in (0,1)\}_{t=1}^{T}$ 来控制的。前向过程由于每个时刻 $t$ 只与 $t - 1$ 时刻有关，所以也可以看做马尔科夫过程：
在这里插入图片描述
这个过程中，随着 $t$ 的增大， $x_t$ 越来越接近纯噪声。当 $T\to \infty$ ， $x_t$ 是完全的高斯噪声（下面会证明，且与均值系数的选择 $\sqrt{1-\beta_t}$ 有关。

前向过程介绍结束前，需要讲述一下 diffusion 在实现和推导过程中要用到的两个重要特性。

特性 1：重参数（reparameterization trick)

重参数技巧在很多工作（gumbel softmax, VAE）中有所引用。如果我们要从某个分布中随机采样 (高斯分布) 一个样本，这个过程是无法反传梯度的。而这个通过高斯噪声采样得到$x_t $ 的过程在 diffusion 中到处都是，因此我们需要通过重参数技巧来使得他可微。

最通常的做法是把随机性通过一个独立的随机变量 ( $\epsilon$ ) 引导过去。即如果要从高斯分布 $\sim N(z;\mu_{\theta},\delta^{2}_{\theta}I)$ 采样一个，我们可以写成:
在这里插入图片描述
上式的 $z$ 依旧是有随机性的，且满足均值为 $\mu_{\theta}$ 方差为 $\delta^{2}_{\theta}$ 的高斯分布。这里的 $\mu_{\theta}$ ， $\delta^{2}_{\theta}$ 可以是由参数 $\theta$ 的神经网络推断得到的。整个 “采样” 过程依旧梯度可导，随机性被转嫁到了上 $\epsilon$ 。

特性 2：任意时刻的 $x_t$ 可以由 $x_0$ 和 $\beta_t$ 表示

在前向过程中，有一个性质非常棒，就是我们其实可以通过 $x_0$ 和 $\beta$ 直接得到 $x_t$ 。
$\begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t}\boldsymbol{\epsilon}_{t-1} & \text{ ;where } \boldsymbol{\epsilon}_{t-1}, \boldsymbol{\epsilon}_{t-2}, \dots \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\boldsymbol{\epsilon}}_{t-2} & \text{ ;where } \bar{\boldsymbol{\epsilon}}_{t-2} \text{ merges two Gaussians (*).} \\ &= \dots \\ &= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon} \\ q(\mathbf{x}_t \vert \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I}) \end{aligned}$
由于两个独立高斯分布可加性，即 $\mathcal{N}(\mathbf{0}, \sigma_1^2\mathbf{I})$ 和 $\mathcal{N}(\mathbf{0}, \sigma_1^2\mathbf{I})$ ，所以因此在推导的第二行，我们混合两个高斯分布得到标准差为 $\sqrt{(1 - \alpha_t) + \alpha_t (1-\alpha_{t-1})} = \sqrt{1 - \alpha_t\alpha_{t-1}}$ 。的混合高斯分布。因此任意时刻的 $x_t$ 满足:
$q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \color{blue}{\tilde{\boldsymbol{\mu}}}(\mathbf{x}_t, \mathbf{x}_0), \color{red}{\tilde{\beta}_t} \mathbf{I})$ 。
实际上 $\beta_t$ 随着 $t$ 增大是递增的，即 $\beta_1 < \beta_2 < \dots < \beta_T$ 。在 GLIDE 的 code 中，是由 0.0001 到 0.02 线性插值（以为基准，增加，对应增大）。因此 $\bar{\alpha}_1 > \dots > \bar{\alpha}_T$ 。

Diffusion 逆扩散过程

如果说前向过程 (forward) 是加噪的过程，那么逆向过程(reverse) 就是diffusion 的去噪推断过程。
如果我们能够逆转上述过程并从 $p(x_{t-1}|x_t)$ 采样，就可以从高斯噪声 $\mathcal{N}(\mathbf{0}, \mathbf{I})$ 还原出原图分布。在文献7中证明了如果p(x_t|x_{t-1})$满足高斯分布且 $\beta_t$ 足够小， $p(x_{t-1}|x_t)$ 仍然是一个高斯分布。然而我们无法简单推断，因此我们使用深度学习模型（参数为 $\theta$ ，目前主流是 U-Net+attention 的结构）去预测这样的一个逆向的分布（类似 VAE）：
$p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) \quad p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$
然而在论文中，作者把条件概率 $p_{\theta}(x_{t-1}|x_t)$ 的方差直接取了 $\beta_t$ ，而不是上面说的需要网络去估计的 $\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)$ ，所以说实际上只有均值需要网络去估计(后续论文也有将其变为网络预测的做法.
讲到这里我们就发现其实正向扩散和逆扩散过程都是一个骨架对吧，都是马尔可夫，然后正态分布，然后一步一步的条件概率，唯一的区别就是正向扩散里每一个条件概率的高斯分布的均值和方差都是已经确定的（依赖于 $\beta_t$ 和 $x_0$ ），而逆扩散过程里面的均值和方差是我们网络要学出来。

逆扩散条件概率推导

虽然我们无法得到逆转过程的概率分布 $q(x_{t-1}|x_t)$ ，但是如果知道 $x_0$ ， $q(x_{t-1}|x_t,x_0)$ 就可以直接写出，这个玩意儿大概是这么个形式:
$q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \color{blue}{\tilde{\boldsymbol{\mu}}}(\mathbf{x}_t, \mathbf{x}_0), \color{red}{\tilde{\beta}_t} \mathbf{I})$ 。

由贝叶斯公式: $P(A|B)=\frac{P(A)P(B|A)}{P(B}$
基本的条件概率定理：
乘法定理：若 $P (A) > 0$ ，则：
$P (A B = P (B ∣ A) P (A)$
$P (A BC) = P (A) P (B ∣ A) P (C ∣ A B)$
带入贝叶斯公式: $P(A|B)=\frac{P(AB)}{B}$
可以通过贝叶斯公式推导如下：
$\begin{aligned} q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) &= q(\mathbf{x}_t \vert \mathbf{x}_{t-1}, \mathbf{x}_0) \frac{ q(\mathbf{x}_{t-1} \vert \mathbf{x}_0) }{ q(\mathbf{x}_t \vert \mathbf{x}_0) } \\ &\propto \exp \Big(-\frac{1}{2} \big(\frac{(\mathbf{x}_t - \sqrt{\alpha_t} \mathbf{x}_{t-1})^2}{\beta_t} + \frac{(\mathbf{x}_{t-1} - \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0)^2}{1-\bar{\alpha}_{t-1}} - \frac{(\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} \big) \Big) \\ &= \exp \Big(-\frac{1}{2} \big(\frac{\mathbf{x}_t^2 - 2\sqrt{\alpha_t} \mathbf{x}_t \color{blue}{\mathbf{x}_{t-1}} \color{black}{+ \alpha_t} \color{red}{\mathbf{x}_{t-1}^2} }{\beta_t} + \frac{ \color{red}{\mathbf{x}_{t-1}^2} \color{black}{- 2 \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0} \color{blue}{\mathbf{x}_{t-1}} \color{black}{+ \bar{\alpha}_{t-1} \mathbf{x}_0^2} }{1-\bar{\alpha}_{t-1}} - \frac{(\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} \big) \Big) \\ &= \exp\Big( -\frac{1}{2} \big( \color{red}{(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}})} \mathbf{x}_{t-1}^2 - \color{blue}{(\frac{2\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{2\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0)} \mathbf{x}_{t-1} \color{black}{ + C(\mathbf{x}_t, \mathbf{x}_0) \big) \Big)} \end{aligned}$
巧妙地将逆向过程全部变回了前向，即: $(x_{t-1},x_0)\rightarrow x_t; x_0 \rightarrow x_t; x_0 \rightarrow x_{t-1}$ 。
请注意，由于前向过程具有 $q(x_{t-1}|x_t,x_0)$ 马尔可夫性质实际上等价于
$q(x_{t-1}|x_t)$ 。
由于一般的高斯概率密度函数的指数部分应该写为：
$exp(-\frac{(x-\mu)^2)}{2 \sigma^2}) = exp( -\frac{1}{2} (\frac{1}{\sigma^2} x^2- \frac{2 \mu}{\sigma^2}x+\frac{\mu^2}{\sigma^2}))$ 。
因此稍加整理我们可以得到上面式子中的方差和均值为：
$\begin{aligned} \tilde{\beta}_t &= 1/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) = 1/(\frac{\alpha_t - \bar{\alpha}_t + \beta_t}{\beta_t(1 - \bar{\alpha}_{t-1})}) = \color{green}{\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t} \\ \tilde{\boldsymbol{\mu}}_t (\mathbf{x}_t, \mathbf{x}_0) &= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0)/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) \\ &= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0) \color{green}{\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t} \\ &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0\\ \end{aligned}$
方差 $\tilde{\beta}_t$ 放着就不用管了可以拿来用了。
关于均值的话，我们得知 $\mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t)$ ，因此带入上式可以得到：
$\begin{aligned} \tilde{\boldsymbol{\mu}}_t &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t) \\ &= \color{cyan}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \Big)} \end{aligned}$
可以看出，在给定 $x_0$ 的条件下，后验条件高斯分布的的均值只和超参数, $x_t$ ,、 $\epsilon_t$ 有关，方差只与超参数有关。于是我们得到了我们就得到了 $q(x_{t-1}|x_t,x_0)$ 的解析形式。

训练损失

搞清楚逆扩散过程之后，现在算是搞清楚去噪推断过程。但是如何训练 Diffusion Models 以求得公式 $q(x_{t-1}|x_t,x_0)$ 的均值 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t))$ 和方差 $\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$ 。
在 VAE 中我们学过极大似然估计的作用：对于真实的训练样本数据已知，要求模型的参数，可以使用极大似然估计。
统计学中，似然函数是一种关于统计模型参数的函数。给定输出 $x$ 时，关于参数 $θ$ 的似然函数 $L (θ ∣ x)$ （在数值上）等于给定参数θ后变量 $X$ 的概率： $L (θ ∣ x) = P (X = x ∣ θ)$ 。
Diffusion Models 通过极大似然估计，来找到逆扩散过程中马尔可夫链转换的概率分布，这就是 Diffusion Models 的训练目的。即最大化模型预测分布的对数似然，从Loss下降的角度就是最小化负对数似然:
$\begin{aligned} L = \mathbb{E}_{q(\mathbf{x}_0)} \log p_\theta(\mathbf{x}_0) \end{aligned}$
这个过程很像VAE，即可以使用变分下界(VLB)来优化负对数似然。
我们回顾一下， KL 散度是一种不对称统计距离度量，用于衡量一个概率分布 $P$ 与另外一个概率分布 $Q$ 的差异程度。
连续分布的 KL 散度的数学形式是：
$\begin{aligned} D_{KL}(P||Q)= \int_{-\infty}^{\infty}p(x) log(\frac{p(x)}{q(x)})\, {\rm d} x \end{aligned}$ 。
KL散度的性质:
1、非对称性： $D_{KL}(P||Q) \neq D_{KL}(Q||P)$ 。
2、 $D_{KL}(P||Q) \geq0$ ，仅 $p = Q$ 在时等于0。
由于KL散度非负，可得到：
$\begin{aligned}- \log p_\theta(\mathbf{x}_0) &\leq - \log p_\theta(\mathbf{x}_0) + D_\text{KL}(q(\mathbf{x}_{1:T}\vert\mathbf{x}_0) \| p_\theta(\mathbf{x}_{1:T}\vert\mathbf{x}_0) ) \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_{\mathbf{x}_{1:T}\sim q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T}) / p_\theta(\mathbf{x}_0)} \Big] \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} + \log p_\theta(\mathbf{x}_0) \Big] \\ &= \mathbb{E}_q \Big[ \log \frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ \text{Let }L_\text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log \frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \geq - \mathbb{E}_{q(\mathbf{x}_0)} \log p_\theta(\mathbf{x}_0) \end{aligned}$
进一步对推导 $L_{VLB}$ ，可以得到熵与多个 KL 散度的累加，具体可见文献 [8]. 这里我就复制一波 Lil 的博客中的推导过程：
使用 Jensen 不等式也很容易得到相同的结果。假设我们想最小化交叉熵作为学习目的:
$\begin{aligned} L_\text{CE} &= - \mathbb{E}_{q(\mathbf{x}_0)} \log p_\theta(\mathbf{x}_0) \\ &= - \mathbb{E}_{q(\mathbf{x}_0)} \log \Big( \int p_\theta(\mathbf{x}_{0:T}) d\mathbf{x}_{1:T} \Big) \\ &= - \mathbb{E}_{q(\mathbf{x}_0)} \log \Big( \int q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})} d\mathbf{x}_{1:T} \Big) \\ &= - \mathbb{E}_{q(\mathbf{x}_0)} \log \Big( \mathbb{E}_{q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})} \Big) \\ &\leq - \mathbb{E}_{q(\mathbf{x}_{0:T})} \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})} \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})}\Big[\log \frac{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})}{p_\theta(\mathbf{x}_{0:T})} \Big] = L_\text{VLB} \end{aligned}$
为了将方程中的每个项转换为可分析计算的，可以将目标进一步重写为几个 KL 散度和熵项的组合（参见 Sohl-Dickstein 等人的附录 B 中的详细分步过程):
$\begin{aligned} L_\text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ &= \mathbb{E}_q \Big[ \log\frac{\prod_{t=1}^T q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{ p_\theta(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t) } \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \Big( \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)}\cdot \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1}\vert\mathbf{x}_0)} \Big) + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1} \vert \mathbf{x}_0)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{q(\mathbf{x}_1 \vert \mathbf{x}_0)} + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big]\\ &= \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_T)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} - \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1) \Big] \\ &= \mathbb{E}_q [\underbrace{D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T))}_{L_T} + \sum_{t=2}^T \underbrace{D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t))}_{L_{t-1}} \underbrace{- \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)}_{L_0} ] \end{aligned}$
让我们把这堆东西简化一下：
$\begin{aligned} L_\text{VLB} &= L_T + L_{T-1} + \dots + L_0 \\ \text{where } L_T &= D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T)) \\ L_t &= D_\text{KL}(q(\mathbf{x}_t \vert \mathbf{x}_{t+1}, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_t \vert\mathbf{x}_{t+1})) \text{ for }1 \leq t \leq T-1 \\ L_0 &= - \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1) \end{aligned}$
接下来我们对 $L_T、L_T、L_0$ 这三种情况进行分类讨论：
首先，由于前向过程 $q$ 没有可学习参数，而 $x_T$ 则是纯高斯噪声，因此 $L_T$ 可以当做常量忽略。
然后， $L_t$ 是KL散度，则可以看做拉近 2 个分布的距离：
1、第一个 $q(x_{t-1}|x_t,x_0)$ 分布我们已经在上一节推导出其解析形式，这是一个高斯分布，其均值和方差为：
$\tilde{\boldsymbol{\mu}}_t = \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \Big), \tilde{\beta}_t = {\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t}$
2、第二个分布 $p_{\theta}(x_{t-1}|x_t)$ 是我们网络期望拟合的目标分布，也是一个高斯分布，均值用网络估计，方差被设置为了一个和 $\beta_t$ 有关的常数。
$\begin{aligned} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)) \end{aligned}$
我们知道如果有两个分布 $p$ , $q$ 都是高斯分布，则他们的KL散度为:
$\begin{aligned} KL(p,q)=log\frac{\theta_2}{\theta_1}+\frac{\theta_1^2+(\mu_1-\mu_2)^2}{2\theta_2^2}-\frac{1}{2} \end{aligned}$ 。
然后因为这两个分布的方差全是常数，和优化无关，所以其实优化目标就是两个分布均值的二范数：
$\begin{aligned} L_t &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \Big[\frac{1}{2 \| \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) \|^2_2} \| \color{blue}{\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)} - \color{green}{\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)} \|^2 \Big] \\ &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \Big[\frac{1}{2 \|\boldsymbol{\Sigma}_\theta \|^2_2} \| \color{blue}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \Big)} - \color{green}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t) \Big)} \|^2 \Big] \\ &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \Big[\frac{ (1 - \alpha_t)^2 }{2 \alpha_t (1 - \bar{\alpha}_t) \| \boldsymbol{\Sigma}_\theta \|^2_2} \|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2 \Big] \\ &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \Big[\frac{ (1 - \alpha_t)^2 }{2 \alpha_t (1 - \bar{\alpha}_t) \| \boldsymbol{\Sigma}_\theta \|^2_2} \|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t, t)\|^2 \Big] \end{aligned}$
这个时候我们应该也是可以用网络直接预测 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t))$ ，但是，可以看出来 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t))$ 是要尽可能的去预测 $\tilde{\boldsymbol{\mu}}_t = \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \Big)$ 。
因为 $x_o$ 是 $\mu_{\theta}$ 的输入，其他的量都是常数 $\epsilon$ ，所以其中的未知量其实只有，所以我们干脆把 $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t))$ 定义为：
$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t))= = \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t (x_t,t) \Big)$
也就是说不用网络直接预测 $\tilde{\boldsymbol{\mu}}_\theta(\mathbf{x}_t, t))$ ，而是用网络先预测 $\boldsymbol{\epsilon}_t (x_t,t)$ 噪声，然后把预测出来的噪声带入到定义好的表达式去计算出预测的均值,其实是一样的。
经过这样一番推导之后就是个 L2 loss。网络的输入是一张和噪声线性组合的图片，然后要估计出来这个噪声:
$\begin{aligned} L_t^\text{simple} &= \mathbb{E}_{t \sim [1, T], \mathbf{x}_0, \boldsymbol{\epsilon}_t} \Big[\|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2 \Big] \\ &= \mathbb{E}_{t \sim [1, T], \mathbf{x}_0, \boldsymbol{\epsilon}_t} \Big[\|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t, t)\|^2 \Big] \end{aligned}$

训练过程

训练过程如图左边 Algorithm 1 Training 部分：
1、从标准高斯分布采样一个噪声 $\boldsymbol{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ ；
2、通过梯度下降最小化损失 $\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t, t)\|^2$ ；
3、训练到收敛为止（训练时间比较长，T 代码中设置为 1000）。
测试（采样）如图右边 Algorithm 2 Sampling 部分：
1、从标准高斯分布采样一个噪声 $\boldsymbol{x_T} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ ；
2、从时间步 T 开始正向扩散迭代到时间步 1；
3、如果时间步不为 1，则从标准高斯分布采样一个噪声 $\boldsymbol{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ ，否则 $z = 0$ ；
4、根据高斯分布计算每个时间步 t 的噪声图；
在这里插入图片描述

快速回顾

在这里插入图片描述

正向 / 扩散过程

正向过程或者说是扩散过程，采用的是一个固定的 Markov chain 形式，即逐步地向图片添加高斯噪声：
$q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I}) \quad q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) = \prod^T_{t=1} q(\mathbf{x}_t \vert \mathbf{x}_{t-1})$
在 DDPM 中， $\beta_t$ 是预先设置的定值参数。
扩散过程有一个重要的特性，我们可以直接采样任意时刻 $t$ 下的加噪结果。将 $\alpha_t = 1 - \beta_t$ , $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ ，则我们可以得到：
$\begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t}\boldsymbol{\epsilon}_{t-1} & \text{ ;where } \boldsymbol{\epsilon}_{t-1}, \boldsymbol{\epsilon}_{t-2}, \dots \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\boldsymbol{\epsilon}}_{t-2} & \text{ ;where } \bar{\boldsymbol{\epsilon}}_{t-2} \text{ merges two Gaussians (*).} \\ &= \dots \\ &= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon} \\ q(\mathbf{x}_t \vert \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I}) \end{aligned}$
这个解析公式使得我们可以直接获得任意程度的加噪图片，方便后续的训练。

逆向过程

逆向过程从一张随机高斯噪声图片 $x_T$ 开始，通过逐步去噪生成最终的结 $x_0$ 。这个过程是一个Markov Chain，可以被定义为：
$p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) \quad p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$
这个过程可以理解为，我们根据 $x_t$ 作为输入，预测高斯分布的均值和方差，再基于预测的分布进行随机采样得到 $x_{t-1}$ 。通过不断的预测和采样过程，最终生成一张真实的图片 $x_0$ 。
$\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t, t)\|^2$
此处 $\epsilon$ 是高斯噪声。这里，噪声预测网络以加噪图片作为输入，目标是预测所添加的噪声。此训练目标即希望预测的噪声和真实的噪声一致。最终在 DDPM 中，均值 $\mu_{\theta}$ 的定义为

模型训练

为了实现基于扩散模型的生成，DDPM 采用了一个 U-Net 结构的 Autoencoder 来对时刻的噪声进行预测，即 $\boldsymbol{\epsilon}_t (x_t,t)$ 。网络训练时采用的训练目标非常简单：
$\begin{aligned} \tilde{\boldsymbol{\mu}}_t &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t) \\ &= \color{cyan}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \Big)} \end{aligned}$
在 DDPM 中，逆向过程中高斯分布的方差项 $\Sigma_{\theta}$ 采用的是一个常数项，后续也有工作用另外的网络分支去单独预测方差项，来获得更好的生成效果。
在这里插入图片描述

总结

Diffusion Model 通过参数化的方式表示为马尔科夫链，这意味着隐变量 $x_1,...,x_T$ 都满足当前时间步 $t$ 只依赖于上一个时间步 $t - 1$ ，这样对后续计算很有帮助。
马尔科夫链中的转变概率分布 $p_{\theta}(x_{t-1}|x_t)$ 服从高斯分布，在正向扩散过程当中高斯分布的参数是直接设定的，而逆向过程中的高斯分布参数是通过学习得到的。
Diffusion Model 网络模型扩展性和鲁棒性比较强，可以选择输入和输出维度相同的网络模型，例如类似于UNet的架构，保持网络模型的输入和输出 Tensor dims 相等。
Diffusion Model 的目的是对输入数据求极大似然函数，实际表现为通过训练来调整模型参数以最小化数据的负对数似然的变分上限
在概率分布转换过程中，因为通过马尔科夫假设，目标函数第4点中的变分上限都可以转变为利用 KL 散度来计算，因此避免了采用蒙特卡洛采样的方式。

参考文献

1、https://zhuanlan.zhihu.com/p/549623622
2、https://zhuanlan.zhihu.com/p/449284962
3、https://zhuanlan.zhihu.com/p/532736667
4、https://zhuanlan.zhihu.com/p/525106459
5、https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
6、Jonathan Ho et al. “Denoising diffusion probabilistic models.” arxiv Preprint arxiv:2006.11239 (2020).
7、Prafula Dhariwal & Alex Nichol. “Diffusion Models Beat GANs on Image Synthesis." arxiv Preprint arxiv:2105.05233 (2021).