模式识别——EM算法
一、概率与似然 考虑以θ\thetaθ为参数的分布x;θx;\thetax;θ,若x;θx;\thetax;θ的分布已知,则该分布的一个随机样本xi=E[x;θ]x_i = E[x;\theta]xi=E[x;θ]若x;θx;\thetax;θ的分布未知,对符合其分布的样本进行抽样,得到了{xn}\{x_n\}{xn},则该分布的参数θ=argmaxθ lnL(θ∣xi)\the.
一、密度估计
对于某些有一定意义的数据,其可能不服从于任何标准的概率分布,使用密度估计算法估计其概率密度。一般的,一个非标准的分布都可以使用多个高斯函数进行拟合,称为高斯混合模型【GMM】。
考虑未知的随机变量 z \bm{z} z,其可能是隐藏的,或者未知的。考虑 x ( i ) , z ( i ) \bm{x}^{(i)}, \bm{z}^{(i)} x(i),z(i)的联合概率分布 p ( x ( i ) , z ( i ) ) = p ( x ( i ) ∣ z ( i ) ) p ( z ( i ) ) p(\bm{x}^{(i)}, \bm{z}^{(i)}) = p(\bm{x}^{(i)}|\bm{z}^{(i)})p(\bm{z}^{(i)}) p(x(i),z(i))=p(x(i)∣z(i))p(z(i))且 z ( i ) ∼ B ( k , ϕ ) \bm{z}^{(i)}\sim B(k, \phi) z(i)∼B(k,ϕ)代表k个高斯分布的概率,以及 x ( i ) ∣ z ( i ) = j ∼ N ( μ j , Σ j ) \bm{x}^{(i)}|\bm{z}^{(i)} = j \sim N(\bm\mu_j, \bm\Sigma_j) x(i)∣z(i)=j∼N(μj,Σj)代表已知的第j个高斯分布的数据概率分布。
如果 z ( i ) \bm{z}^{(i)} z(i)是已知的,则可以使用极大似然估计,形如 l n L ( ϕ , μ , Σ ) = ∑ i = 1 m l o g p ( x ( i ) , z ( i ) ; ϕ , μ , Σ ) lnL(\phi, \bm\mu, \bm\Sigma) = \sum_{i =1}^m log\ p(\bm{x}^{(i)}, \bm{z}^{(i)};\phi, \bm\mu, \bm\Sigma) lnL(ϕ,μ,Σ)=i=1∑mlog p(x(i),z(i);ϕ,μ,Σ)可以得到 ϕ j = ∑ i = 1 m 1 { z ( i ) = j } / m μ j = ∑ i = 1 m 1 { c ( i ) = j } x ( i ) / ∑ i = 1 m 1 { c ( i ) = j } \phi_j = \sum_{i=1}^m 1\{z^{(i)} = j\}/m \\ \bm\mu_j = \sum_{i = 1}^m 1\{c^{(i)} = j\}\bm{x}^{(i)}/ \sum_{i = 1}^m 1\{c^{(i)} = j\} ϕj=i=1∑m1{z(i)=j}/mμj=i=1∑m1{c(i)=j}x(i)/i=1∑m1{c(i)=j}其称为高斯判别模型【GDA】。然而 z ( i ) \bm{z}^{(i)} z(i)是未知的,可以考虑尝试使用模型猜测 z ( i ) \bm{z}^{(i)} z(i)的值,使用极大似然拟合出更好的参数的值,再去猜测 z ( i ) \bm{z}^{(i)} z(i)的值,并进行迭代,该算法称为最大期望算法。
二、EM算法
EM算法的步骤如下:
(1)猜测未知的 z ( i ) \bm{z}^{(i)} z(i)的值,称为E步;
(2)最大似然估计参数的值,称为M步。
(3)迭代(1)-(2),直到收敛。
2.1 E步
E步用于猜测未知的 z ( i ) \bm{z}^{(i)} z(i)的值,形如 w j ( i ) = p ( z ( i ) = j ∣ x ( i ) ; ϕ , μ , Σ ) w_j^{(i)} = p(\bm{z}^{(i)} = j|\bm{x}^{(i)}; \phi, \bm\mu, \bm\Sigma) wj(i)=p(z(i)=j∣x(i);ϕ,μ,Σ)该步计算了 z ( i ) = j \bm{z}^{(i)} = j z(i)=j的概率,即 w j ( i ) w_j^{(i)} wj(i)表示了 x ( i ) \bm{x}^{(i)} x(i)由第j个高斯分布生成占所有高斯分布生成的概率,根据贝叶斯公式 p ( B i ∣ A ) = P ( B i ) P ( A ∣ B i ) / ∑ j = 1 n P ( B j ) P ( A ∣ B j ) p(B_i|A) = P(B_i)P(A|B_i)/\sum_{j=1}^n P(B_j)P(A|B_j) p(Bi∣A)=P(Bi)P(A∣Bi)/j=1∑nP(Bj)P(A∣Bj)形如 z ^ ( j ) = p ( x ( i ) ∣ z ( i ) = j ) p ( z ( i ) = j ) / ∑ l = 1 c p ( x ( i ) ∣ z ( i ) = l ) p ( z ( i ) = l ) \hat\bm{z}^{(j)} = p(\bm{x}^{(i)}|\bm{z}^{(i)} = j)p(\bm{z}^{(i)} = j)/\sum_{l=1}^cp(\bm{x}^{(i)}|\bm{z}^{(i)} = l)p(\bm{z}^{(i)} = l) z^(j)=p(x(i)∣z(i)=j)p(z(i)=j)/l=1∑cp(x(i)∣z(i)=l)p(z(i)=l)
2.2 M步
M步用于更新对参数 ϕ \phi ϕ的更新,形如 ϕ j = ∑ i = 1 m w j ( i ) / m μ j = ∑ i = 1 m w j ( i ) x ( i ) / ∑ i = 1 m x ( i ) Σ j = ∑ i = 1 m w j ( i ) ( x ( i ) − μ j ) ( x ( i ) − μ j ) T / ∑ i = 1 m x ( i ) \phi_j = \sum_{i=1}^m w_j^{(i)} / m \\ \bm\mu_j = \sum_{i=1}^m w_j^{(i)}\bm{x}^{(i)}/\sum_{i=1}^m \bm{x}^{(i)} \\ \bm\Sigma_j = \sum_{i=1}^m w_j^{(i)}(\bm{x}^{(i)} - \bm\mu_j)(\bm{x}^{(i)} - \bm\mu_j)^T/\sum_{i=1}^m \bm{x}^{(i)} ϕj=i=1∑mwj(i)/mμj=i=1∑mwj(i)x(i)/i=1∑mx(i)Σj=i=1∑mwj(i)(x(i)−μj)(x(i)−μj)T/i=1∑mx(i) 其与 z ( i ) \bm{z}^{(i)} z(i)已知情况的区别仅仅是第i个点不是由第j个高斯分布生成的指示函数,而是概率,这对应了 z ( i ) \bm{z}^{(i)} z(i)已知与 z ( i ) \bm{z}^{(i)} z(i)未知的情况。
三、EM算法的一般形式
GMM只是EM算法的一个特例,接下来介绍EM算法的一般形式。
3.1 琴生不等式
首先介绍琴生【Jensen】不等式。考虑凸函数 f f f,取随机变量 x \bm{x} x,那么有 f ( E [ x ] ) ≤ E [ f ( x ) ] f(E[\bm{x}]) \le E[f(\bm{x})] f(E[x])≤E[f(x)]其直观理解为凸函数的割线在函数曲线上方。若 f f f是严格凸的,即 d 2 f / d x 2 > 0 d^2f/dx^2 > 0 d2f/dx2>0,那么上述等号当且仅当 x = E [ x ] \bm{x} = E[\bm{x}] x=E[x]时成立,即 x \bm{x} x以1的概率取得某常数值。
3.2 最大似然下界理论
考虑模型的概率分布 p ( x , z ; θ ) p(\bm{x}, \bm{z};\bm\theta) p(x,z;θ),其中,仅有 x \bm{x} x可以被观测,尝试最大似然 l n L ( θ ) = ∑ i = 1 m l o g p ( x ( i ) ; θ ) = ∑ i = 1 m l o g ∑ z ( i ) p ( x ( i ) , z ( i ) ; θ ) \begin{aligned}lnL(\bm\theta) &= \sum_{i=1}^mlog\ p(\bm{x}^{(i)};\bm\theta)\\&= \sum_{i=1}^mlog \sum_{\bm{z}^{(i)}}p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)\end{aligned} lnL(θ)=i=1∑mlog p(x(i);θ)=i=1∑mlogz(i)∑p(x(i),z(i);θ)EM算法就是极大似然估计的过程,但其并非直接求解最大似然导数,其迭代的对似然函数创造了一个下界并寻找下界的最大值,从而靠近似然函数的最大值。由于 z \bm{z} z未被观测,因此似然函数的变量仅有 x \bm{x} x,因此,有 m a x θ l n L ( θ ) = m a x θ ∑ i = 1 m l o g p ( x ( i ) ; θ ) = m a x θ ∑ i = 1 m l o g ∑ z ( i ) p ( x ( i ) , z ( i ) ; θ ) \begin{aligned}max_\bm\theta\ lnL(\bm\theta) &= max_\bm\theta\ \sum_{i=1}^mlog\ p(\bm{x}^{(i)};\bm\theta) \\&=max_\bm\theta\ \sum_{i=1}^mlog \sum_{\bm{z}^{(i)}}p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)\end{aligned} maxθ lnL(θ)=maxθ i=1∑mlog p(x(i);θ)=maxθ i=1∑mlogz(i)∑p(x(i),z(i);θ)引入辅助函数 Q i ( z ( i ) ) Q_i(\bm{z}^{(i)}) Qi(z(i)),其是未知随机变量 z ( i ) \bm{z}^{(i)} z(i)的概率分布,有 Q i ( z ( i ) ) ≥ 0 ∑ z ( i ) Q i ( z ( i ) ) = 1 Q_i(\bm{z}^{(i)}) \ge 0 \\ \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)}) = 1 Qi(z(i))≥0z(i)∑Qi(z(i))=1再考虑数学期望公式 E X = ∑ k = 1 m X k p k EX = \sum_{k=1}^m X_kp_k EX=k=1∑mXkpk那么有 ∑ i = 1 m l o g ∑ z ( i ) p ( x ( i ) , z ( i ) ; θ ) = ∑ i = 1 m l o g ∑ z ( i ) Q i ( z ( i ) ) p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) = ∑ i = 1 m l o g E [ p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ] \begin{aligned}\sum_{i=1}^mlog \sum_{\bm{z}^{(i)}}p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)& = \sum_{i=1}^mlog \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)})p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)}) \\&= \sum_{i=1}^mlog E[p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})] \end{aligned} i=1∑mlogz(i)∑p(x(i),z(i);θ)=i=1∑mlogz(i)∑Qi(z(i))p(x(i),z(i);θ)/Qi(z(i))=i=1∑mlogE[p(x(i),z(i);θ)/Qi(z(i))]这里是对所有的 z ( i ) \bm{z}^{(i)} z(i)对其概率分布乘以其函数进行求和,即该函数的期望。
对数函数 l o g ( x ) log(x) log(x)是一个凹函数,那么根据琴生不等式,有 l o g ( E X ) ≥ E [ l o g ( X ) ] log(EX) \ge E[log(X)] log(EX)≥E[log(X)]故 ∑ i = 1 m l o g E [ p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ] ≥ ∑ i = 1 m E [ l o g p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ] = ∑ i = 1 m ∑ z ( i ) Q i ( z ( i ) ) l o g ( p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ) \begin{aligned}\sum_{i=1}^mlog E[p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})]& \ge \sum_{i=1}^mE[log p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})] \\&= \sum_{i=1}^m \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)})log(p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})) \end{aligned} i=1∑mlogE[p(x(i),z(i);θ)/Qi(z(i))]≥i=1∑mE[logp(x(i),z(i);θ)/Qi(z(i))]=i=1∑mz(i)∑Qi(z(i))log(p(x(i),z(i);θ)/Qi(z(i)))即 l n L ( θ ) ≥ ∑ i = 1 m ∑ z ( i ) Q i ( z ( i ) ) l o g ( p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ) lnL(\bm\theta) \ge \sum_{i=1}^m \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)})log(p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})) lnL(θ)≥i=1∑mz(i)∑Qi(z(i))log(p(x(i),z(i);θ)/Qi(z(i)))为了取得下界的最优值,即等号成立,即需 ∑ i = 1 m l o g E [ p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ] = ∑ i = 1 m E [ l o g p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ] \sum_{i=1}^mlog E[p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})] = \sum_{i=1}^mE[log p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})] i=1∑mlogE[p(x(i),z(i);θ)/Qi(z(i))]=i=1∑mE[logp(x(i),z(i);θ)/Qi(z(i))]又需要对任何数据均有等式成立,因此需要选择可能的分配 Q Q Q来保证。根据等号成立的条件,需要 p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) = C o n s t p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)}) = Const p(x(i),z(i);θ)/Qi(z(i))=Const即对任何 z ( i ) \bm{z}^{(i)} z(i)都取得相同的值,即 Q i ( z ( i ) ) = Θ ( p ( x ( i ) , z ( i ) ; θ ) ) Q_i(\bm{z}^{(i)}) = \Theta(p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)) Qi(z(i))=Θ(p(x(i),z(i);θ))又 ∑ z ( i ) Q i ( z ( i ) ) = 1 \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)}) = 1 z(i)∑Qi(z(i))=1因此 Q i ( z ( i ) ) = p ( x ( i ) , z ( i ) ; θ ) / ∑ z ( i ) p ( x ( i ) , z ( i ) ; θ ) = p ( x ( i ) , z ( i ) ; θ ) / p ( x ( i ) ; θ ) = p ( z ( i ) ∣ x ( i ) ; θ ) \begin{aligned}Q_i(\bm{z}^{(i)})& = p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/\sum_{\bm{z}^{(i)}}p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta) \\&= p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/p(\bm{x}^{(i)};\bm\theta) \\&= p(\bm{z}^{(i)}|\bm{x}^{(i)};\bm\theta) \end{aligned} Qi(z(i))=p(x(i),z(i);θ)/z(i)∑p(x(i),z(i);θ)=p(x(i),z(i);θ)/p(x(i);θ)=p(z(i)∣x(i);θ)
3.3 EM算法理论
在E步中猜测 z ( i ) \bm{z}^{(i)} z(i)的值,即是 Q i ( z ( i ) ) = p ( z ( i ) ∣ x ( i ) ; θ ) Q_i(\bm{z}^{(i)}) = p(\bm{z}^{(i)}|\bm{x}^{(i)};\bm\theta) Qi(z(i))=p(z(i)∣x(i);θ)这为当前的参数 ϕ \phi ϕ的似然函数确定了紧下界。
在M步中更新 ϕ \phi ϕ的值,即是 ϕ = a r g m a x ϕ ∑ i = 1 m ∑ z ( i ) Q i ( z ( i ) ) l o g ( p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ) \phi = argmax_{\phi}\ \sum_{i=1}^m \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)})log(p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})) ϕ=argmaxϕ i=1∑mz(i)∑Qi(z(i))log(p(x(i),z(i);θ)/Qi(z(i)))这在当前的紧下界中找到似然函数下界中的最大值,从而更新参数 ϕ \phi ϕ。
从凸优化的角度考虑,可以考虑 J ( ϕ , Q ) = ∑ i = 1 m ∑ z ( i ) Q i ( z ( i ) ) l o g ( p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ) J(\phi, Q) = \sum_{i=1}^m \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)})log(p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})) J(ϕ,Q)=i=1∑mz(i)∑Qi(z(i))log(p(x(i),z(i);θ)/Qi(z(i)))其中,E步选择 Q Q Q最优化 J J J,M步选择 ϕ \phi ϕ最优化 J J J,这是一种坐标上升法。
更多推荐
所有评论(0)