深度学习算法原理——Softmax Regression
一、Logistic回归简介Logistic回归是解决二分类问题的分类算法。假设有mmm个训练样本{(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))}{(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))}\left \{ \left ( \mathbf{x}^{(1)},y^{(1)} \right ),\left ( \mathbf{x}^{..
1. Logistic回归简介
Logistic回归是解决二分类问题的分类算法。假设有 m m m个训练样本 { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋯ , ( x ( m ) , y ( m ) ) } \left \{ \left ( \mathbf{x}^{(1)},y^{(1)} \right ),\left ( \mathbf{x}^{(2)},y^{(2)} \right ),\cdots ,\left ( \mathbf{x}^{(m)},y^{(m)} \right ) \right \} {(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))},对于Logistic回归,其输入特征为: x ( i ) ∈ ℜ n + 1 \mathbf{x}^{(i)}\in \Re ^{n+1} x(i)∈ℜn+1,类标记为: y ( i ) ∈ { 0 , 1 } y^{(i)}\in \left \{ 0,1 \right \} y(i)∈{0,1},假设函数为Sigmoid函数:
h θ ( x ) = 1 1 + e − θ T x h_\theta \left ( x \right )=\frac{1}{1+e^{-\theta ^Tx}} hθ(x)=1+e−θTx1
其中,模型的参数为 θ \theta θ,需要通过最小化损失函数得到,模型的损失函数为:
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g h θ ( x ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left [ y^{(i)}logh_\theta \left ( \mathbf{x}^{(i)} \right )+\left ( 1-y^{(i)} \right )log\left ( 1-h_\theta \left ( \mathbf{x}^{(i)} \right ) \right ) \right ] J(θ)=−m1i=1∑m[y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))]
此时,可以通过梯度下降法对其进行求解,其梯度为:
▽ θ j J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) h θ ( x ( i ) ) ⋅ ▽ θ j h θ ( x ( i ) ) + 1 − y ( i ) 1 − h θ ( x ( i ) ) ⋅ ▽ θ j ( 1 − h θ ( x ( i ) ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) h θ ( x ( i ) ) ⋅ ▽ θ j h θ ( x ( i ) ) − 1 − y ( i ) 1 − h θ ( x ( i ) ) ⋅ ▽ θ j h θ ( x ( i ) ) ] = − 1 m ∑ i = 1 m [ ( y ( i ) h θ ( x ( i ) ) − 1 − y ( i ) 1 − h θ ( x ( i ) ) ) ⋅ ▽ θ j h θ ( x ( i ) ) ] \begin{matrix} \triangledown _{\theta _j}J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left [ \frac{y^{(i)}}{h_\theta \left ( \mathbf{x}^{(i)} \right )}\cdot \triangledown _{\theta _j}h_\theta \left ( \mathbf{x}^{(i)} \right )+\frac{1-y^{(i)}}{1-h_\theta \left ( \mathbf{x}^{(i)} \right )}\cdot \triangledown _{\theta _j}\left ( 1-h_\theta \left ( \mathbf{x}^{(i)} \right ) \right )\right ]\\ =-\frac{1}{m}\sum_{i=1}^{m}\left [ \frac{y^{(i)}}{h_\theta \left ( \mathbf{x}^{(i)} \right )}\cdot \triangledown _{\theta _j}h_\theta \left ( \mathbf{x}^{(i)} \right )-\frac{1-y^{(i)}}{1-h_\theta \left ( \mathbf{x}^{(i)} \right )}\cdot \triangledown _{\theta _j}h_\theta \left ( \mathbf{x}^{(i)} \right ) \right ]\\ =-\frac{1}{m}\sum_{i=1}^{m}\left [ \left ( \frac{y^{(i)}}{h_\theta \left ( \mathbf{x}^{(i)} \right )}-\frac{1-y^{(i)}}{1-h_\theta \left ( \mathbf{x}^{(i)} \right )} \right )\cdot \triangledown _{\theta _j}h_\theta \left ( \mathbf{x}^{(i)} \right ) \right ] \end{matrix} ▽θjJ(θ)=−m1∑i=1m[hθ(x(i))y(i)⋅▽θjhθ(x(i))+1−hθ(x(i))1−y(i)⋅▽θj(1−hθ(x(i)))]=−m1∑i=1m[hθ(x(i))y(i)⋅▽θjhθ(x(i))−1−hθ(x(i))1−y(i)⋅▽θjhθ(x(i))]=−m1∑i=1m[(hθ(x(i))y(i)−1−hθ(x(i))1−y(i))⋅▽θjhθ(x(i))]
= − 1 m ∑ i = 1 m [ y ( i ) − h θ ( x ( i ) ) h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) ) ⋅ ▽ θ j h θ ( x ( i ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) − h θ ( x ( i ) ) h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) ) ⋅ ▽ θ T x ( i ) h θ ( x ( i ) ) ⋅ ▽ θ j ( θ T x ( i ) ) ] \begin{matrix} =-\frac{1}{m}\sum_{i=1}^{m}\left [ \frac{y^{(i)}-h_\theta \left ( \mathbf{x}^{(i)} \right )}{h_\theta \left ( \mathbf{x}^{(i)} \right )\left ( 1-h_\theta \left ( \mathbf{x}^{(i)} \right ) \right )}\cdot \triangledown _{\theta _j}h_\theta \left ( \mathbf{x}^{(i)} \right ) \right ]\\ =-\frac{1}{m}\sum_{i=1}^{m}\left [ \frac{y^{(i)}-h_\theta \left ( \mathbf{x}^{(i)} \right )}{h_\theta \left ( \mathbf{x}^{(i)} \right )\left ( 1-h_\theta \left ( \mathbf{x}^{(i)} \right ) \right )}\cdot \triangledown _{\theta ^T\mathbf{x}^{(i)}}h_\theta \left ( \mathbf{x}^{(i)} \right )\cdot \triangledown _{\theta _j}\left ( \theta ^T\mathbf{x}^{(i)} \right ) \right ] \end{matrix} =−m1∑i=1m[hθ(x(i))(1−hθ(x(i)))y(i)−hθ(x(i))⋅▽θjhθ(x(i))]=−m1∑i=1m[hθ(x(i))(1−hθ(x(i)))y(i)−hθ(x(i))⋅▽θTx(i)hθ(x(i))⋅▽θj(θTx(i))]
而:
▽ θ T x ( i ) h θ ( x ( i ) ) = h θ ( x ( i ) ) ( 1 − h θ ( x ( i ) ) ) \triangledown _{\theta ^T\mathbf{x}^{(i)}}h_\theta \left ( \mathbf{x}^{(i)} \right )=h_\theta \left ( \mathbf{x}^{(i)} \right )\left ( 1-h_\theta \left ( \mathbf{x}^{(i)} \right ) \right ) ▽θTx(i)hθ(x(i))=hθ(x(i))(1−hθ(x(i)))
▽ θ j ( θ T x ( i ) ) = x j ( i ) \triangledown _{\theta _j}\left ( \theta ^T\mathbf{x}^{(i)} \right )=x^{(i)}_j ▽θj(θTx(i))=xj(i)
因此,梯度的公式为:
▽ θ j J ( θ ) = − 1 m ∑ i = 1 m [ ( y ( i ) − h θ ( x ( i ) ) ) ⋅ x j ( i ) ] \triangledown _{\theta _j}J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left [ \left ( y^{(i)}-h_\theta \left ( \mathbf{x}^{(i)} \right ) \right )\cdot x^{(i)}_j \right ] ▽θjJ(θ)=−m1i=1∑m[(y(i)−hθ(x(i)))⋅xj(i)]
根据梯度下降法,得到如下的更新公式:
θ j : = θ j − α ▽ θ j J ( θ ) \theta _j:=\theta _j-\alpha \triangledown _{\theta _j}J\left ( \theta \right ) θj:=θj−α▽θjJ(θ)
2. Softmax回归
2.1. Softmax回归简介
Softmax是Logistic回归在多分类上的推广,即类标签 y y y的取值大于等于 2 2 2。假设有 m m m个训练样本 { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋯ , ( x ( m ) , y ( m ) ) } \left \{ \left ( \mathbf{x}^{(1)},y^{(1)} \right ),\left ( \mathbf{x}^{(2)},y^{(2)} \right ),\cdots ,\left ( \mathbf{x}^{(m)},y^{(m)} \right ) \right \} {(x(1),y(1)),(x(2),y(2)),⋯,(x(m),y(m))},对于Softmax回归,其输入特征为: x ( i ) ∈ ℜ n + 1 \mathbf{x}^{(i)}\in \Re ^{n+1} x(i)∈ℜn+1,类标记为: y ( i ) ∈ { 0 , 1 , ⋯ k } y^{(i)}\in \left \{ 0,1,\cdots k \right \} y(i)∈{0,1,⋯k}。假设函数为对于每一个样本估计其所属的类别的概率 p ( y = j ∣ x ) p\left ( y=j\mid \mathbf{x} \right ) p(y=j∣x),具体的假设函数为:
h θ ( x ( i ) ) = [ p ( y ( i ) = 1 ∣ x ( i ) ; θ ) p ( y ( i ) = 2 ∣ x ( i ) ; θ ) ⋮ p ( y ( i ) = k ∣ x ( i ) ; θ ) ] = 1 ∑ j = 1 k e θ j T x ( i ) [ e θ 1 T x ( i ) e θ 2 T x ( i ) ⋮ e θ k T x ( i ) ] h_\theta \left ( \mathbf{x}^{(i)} \right )=\begin{bmatrix} p\left ( y^{(i)}=1\mid \mathbf{x}^{(i)};\theta \right )\\ p\left ( y^{(i)}=2\mid \mathbf{x}^{(i)};\theta \right )\\ \vdots \\ p\left ( y^{(i)}=k\mid \mathbf{x}^{(i)};\theta \right ) \end{bmatrix}=\frac{1}{\sum_{j=1}^{k}e^{\theta ^T_j\mathbf{x}^{(i)}}}\begin{bmatrix} e^{\theta ^T_1\mathbf{x}^{(i)}}\\ e^{\theta ^T_2\mathbf{x}^{(i)}}\\ \vdots \\ e^{\theta ^T_k\mathbf{x}^{(i)}} \end{bmatrix} hθ(x(i))=⎣⎢⎢⎢⎡p(y(i)=1∣x(i);θ)p(y(i)=2∣x(i);θ)⋮p(y(i)=k∣x(i);θ)⎦⎥⎥⎥⎤=∑j=1keθjTx(i)1⎣⎢⎢⎢⎢⎡eθ1Tx(i)eθ2Tx(i)⋮eθkTx(i)⎦⎥⎥⎥⎥⎤
其中$\theta 表 示 的 向 量 , 且 表示的向量,且 表示的向量,且\theta _i\in \Re ^{n+1}$。则对于每一个样本估计其所属的类别的概率为:
p ( y ( i ) = j ∣ x ( i ) ; θ ) = e θ j T x ( i ) ∑ l = 1 k e θ l T x ( i ) p\left ( y^{(i)}=j\mid \mathbf{x}^{(i)};\theta \right )=\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}} p(y(i)=j∣x(i);θ)=∑l=1keθlTx(i)eθjTx(i)
2.2. Softmax回归的代价函数
类似于Logistic回归,在Softmax的代价函数中引入指示函数 I { ⋅ } I\left \{ \cdot \right \} I{⋅},其具体形式为:
I { e x p r e s s i o n } = { 0 if e x p r e s s i o n = f a l s e 1 if e x p r e s s i o n = t r u e I\left \{ expression \right \}=\begin{cases} 0 & \text{ if } expression=false \\ 1 & \text{ if } expression=true \end{cases} I{expression}={01 if expression=false if expression=true
那么,对于Softmax回归的代价函数为:
J ( θ ) = − 1 m [ ∑ i = 1 m ∑ j = 1 k I { y ( i ) = j } l o g e θ j T x ( i ) ∑ l = 1 k e θ l T x ( i ) ] J\left ( \theta \right )=-\frac{1}{m}\left [ \sum_{i=1}^{m}\sum_{j=1}^{k}I\left \{ y^{(i)}=j \right \}log\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}} \right ] J(θ)=−m1[i=1∑mj=1∑kI{y(i)=j}log∑l=1keθlTx(i)eθjTx(i)]
2.3. Softmax回归的求解
对于上述的代价函数,可以使用梯度下降法对其进行求解,首先对其进行求梯度:
▽ θ j J ( θ ) = − 1 m ∑ i = 1 m [ ▽ θ j ∑ j = 1 k I { y ( i ) = j } l o g e θ j T x ( i ) ∑ l = 1 k e θ l T x ( i ) ] \triangledown _{\theta _j}J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left [ \triangledown _{\theta _j}\sum_{j=1}^{k}I\left \{ y^{(i)}=j \right \}log\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}} \right ] ▽θjJ(θ)=−m1i=1∑m[▽θjj=1∑kI{y(i)=j}log∑l=1keθlTx(i)eθjTx(i)]
已知,对于一个样本只会属于一个类别:
- 若$ y^{(i)}=j , 则 ,则 ,则I\left { y^{(i)}=j \right }=1$
▽ θ j J ( θ ) = − 1 m ∑ i = 1 m [ ▽ θ j l o g e θ j T x ( i ) ∑ l = 1 k e θ l T x ( i ) ] = − 1 m ∑ i = 1 m [ ∑ l = 1 k e θ l T x ( i ) e θ j T x ( i ) ⋅ e θ j T x ( i ) ⋅ x ( i ) ⋅ ∑ l = 1 k e θ l T x ( i ) − e θ j T x ( i ) ⋅ x ( i ) ⋅ e θ j T x ( i ) ( ∑ l = 1 k e θ l T x ( i ) ) 2 ] = − 1 m ∑ i = 1 m [ ∑ l = 1 k e θ l T x ( i ) − e θ j T x ( i ) ∑ l = 1 k e θ l T x ( i ) ⋅ x ( i ) ] \begin{matrix} \triangledown _{\theta _j}J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left [ \triangledown _{\theta _j}log\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}} \right ]\\ =-\frac{1}{m}\sum_{i=1}^{m}\left [ \frac{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}}{e^{\theta ^T_j\mathbf{x}^{(i)}}}\cdot \frac{e^{\theta ^T_j\mathbf{x}^{(i)}}\cdot \mathbf{x}^{(i)}\cdot \sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}-e^{\theta ^T_j\mathbf{x}^{(i)}}\cdot \mathbf{x}^{(i)}\cdot e^{\theta ^T_j\mathbf{x}^{(i)}}}{\left ( \sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}} \right )^2} \right ]\\ =-\frac{1}{m}\sum_{i=1}^{m}\left [ \frac{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}-e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}}\cdot \mathbf{x}^{(i)} \right ] \end{matrix} ▽θjJ(θ)=−m1∑i=1m[▽θjlog∑l=1keθlTx(i)eθjTx(i)]=−m1∑i=1m⎣⎡eθjTx(i)∑l=1keθlTx(i)⋅(∑l=1keθlTx(i))2eθjTx(i)⋅x(i)⋅∑l=1keθlTx(i)−eθjTx(i)⋅x(i)⋅eθjTx(i)⎦⎤=−m1∑i=1m[∑l=1keθlTx(i)∑l=1keθlTx(i)−eθjTx(i)⋅x(i)]
- 若$ y^{(i)}\neq j , 假 设 ,假设 ,假设y^{(i)}\neq {j}‘ , 则 ,则 ,则I\left { y^{(i)}=j \right }=0 , , ,I\left { y^{(i)}={j}’ \right }=1$
▽ θ j J ( θ ) = − 1 m ∑ i = 1 m [ ▽ θ j l o g e θ j ′ T x ( i ) ∑ l = 1 k e θ l T x ( i ) ] = − 1 m ∑ i = 1 m [ ∑ l = 1 k e θ l T x ( i ) e θ j ′ T x ( i ) ⋅ − e θ j ′ T x ( i ) ⋅ x ( i ) ⋅ e θ j T x ( i ) ( ∑ l = 1 k e θ l T x ( i ) ) 2 ] = − 1 m ∑ i = 1 m [ − e θ j T x ( i ) ∑ l = 1 k e θ l T x ( i ) ⋅ x ( i ) ] \begin{matrix} \triangledown _{\theta _j}J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left [ \triangledown _{\theta _j}log\frac{e^{\theta ^T_{{j}'}\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}} \right ]\\ =-\frac{1}{m}\sum_{i=1}^{m}\left [\frac{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}}{e^{\theta ^T_{{j}'}\mathbf{x}^{(i)}}}\cdot \frac{-e^{\theta ^T_{{j}'}\mathbf{x}^{(i)}}\cdot \mathbf{x}^{(i)}\cdot e^{\theta ^T_j\mathbf{x}^{(i)}}}{\left ( \sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}} \right )^2} \right ]\\ =-\frac{1}{m}\sum_{i=1}^{m}\left [ -\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}}\cdot \mathbf{x}^{(i)} \right ] \end{matrix} ▽θjJ(θ)=−m1∑i=1m[▽θjlog∑l=1keθlTx(i)eθj′Tx(i)]=−m1∑i=1m⎣⎡eθj′Tx(i)∑l=1keθlTx(i)⋅(∑l=1keθlTx(i))2−eθj′Tx(i)⋅x(i)⋅eθjTx(i)⎦⎤=−m1∑i=1m[−∑l=1keθlTx(i)eθjTx(i)⋅x(i)]
最终的结果为:
− 1 m ∑ i = 1 m [ x ( i ) ( I { y ( i ) = j } − p ( y ( i ) = j ∣ x ( i ) ; θ ) ) ] -\frac{1}{m}\sum_{i=1}^{m}\left [ \mathbf{x}^{(i)}\left ( I\left \{ y^{(i)}=j \right \}-p\left ( y^{(i)}=j\mid \mathbf{x}^{(i)};\theta \right ) \right ) \right ] −m1i=1∑m[x(i)(I{y(i)=j}−p(y(i)=j∣x(i);θ))]
注意,此处的 θ j \theta_j θj表示的是一个向量。通过梯度下降法的公式可以更新:
θ j : = θ j − α ▽ θ j J ( θ ) \theta _j:=\theta _j-\alpha \triangledown _{\theta _j}J\left ( \theta \right ) θj:=θj−α▽θjJ(θ)
2.4. Softmax回归中的参数特点
在Softmax回归中存在着参数冗余的问题。简单来讲就是参数中有些参数是没有任何用的,为了证明这点,假设从参数向量 θ j \theta _j θj中减去向量$\psi $,假设函数为:
p ( y ( i ) = j ∣ x ( i ) ; θ ) = e ( θ j − ψ ) T x ( i ) ∑ l = 1 k e ( θ l − ψ ) T x ( i ) = e θ j T x ( i ) ⋅ e − ψ T x ( i ) ∑ l = 1 k e θ l T x ( i ) ⋅ e − ψ T x ( i ) = e θ j T x ( i ) ∑ l = 1 k e θ l T x ( i ) \begin{matrix} p\left ( y^{(i)}=j\mid \mathbf{x}^{(i)};\theta \right )=\frac{e^{(\theta _j-\psi )^T\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{(\theta _l-\psi )^T\mathbf{x}^{(i)}}}\\ =\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}\cdot e^{-\psi ^T\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}\cdot e^{-\psi ^T\mathbf{x}^{(i)}}}\\ =\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}} \end{matrix} p(y(i)=j∣x(i);θ)=∑l=1ke(θl−ψ)Tx(i)e(θj−ψ)Tx(i)=∑l=1keθlTx(i)⋅e−ψTx(i)eθjTx(i)⋅e−ψTx(i)=∑l=1keθlTx(i)eθjTx(i)
从上面可以看出从参数向量 θ j \theta _j θj中减去向量$\psi $对预测结果并没有任何的影响,也就是说在模型中,存在着多组的最优解。
为了是算法能够尽可能简单,保留所有的参数,但是对代价函数加入权重衰减来解决参数冗余的问题,权重衰减即对参数进行正则化。
如对参数进行L2正则约束,L2正则为:
λ 2 ∑ i = 1 k ∑ j = 0 n θ i j 2 \frac{\lambda }{2}\sum_{i=1}^{k}\sum_{j=0}^{n}\theta ^2_{ij} 2λi=1∑kj=0∑nθij2
此时,代价函数为:
J ( θ ) = − 1 m [ ∑ i = 1 m ∑ j = 1 k I { y ( i ) = j } l o g e θ j T x ( i ) ∑ l = 1 k e θ l T x ( i ) ] + λ 2 ∑ i = 1 k ∑ j = 0 n θ i j 2 J\left ( \theta \right )=-\frac{1}{m}\left [ \sum_{i=1}^{m}\sum_{j=1}^{k}I\left \{ y^{(i)}=j \right \}log\frac{e^{\theta ^T_j\mathbf{x}^{(i)}}}{\sum_{l=1}^{k}e^{\theta ^T_l\mathbf{x}^{(i)}}} \right ]+\frac{\lambda }{2}\sum_{i=1}^{k}\sum_{j=0}^{n}\theta ^2_{ij} J(θ)=−m1[i=1∑mj=1∑kI{y(i)=j}log∑l=1keθlTx(i)eθjTx(i)]+2λi=1∑kj=0∑nθij2
其中, λ > 0 \lambda >0 λ>0,此时代价函数是一个严格的凸函数。
对该函数的导数为:
▽ θ j J ( θ ) = − 1 m ∑ i = 1 m [ x ( i ) ( I { y ( i ) = j } − p ( y ( i ) = j ∣ x ( i ) ; θ ) ) ] + λ θ j \triangledown {\theta _j}J\left ( \theta \right )=-\frac{1}{m}\sum_{i=1}^{m}\left [ \mathbf{x}^{(i)}\left ( I\left \{ y^{(i)}=j \right \}-p\left ( y^{(i)}=j\mid \mathbf{x}^{(i)};\theta \right ) \right ) \right ]+\lambda \theta _j ▽θjJ(θ)=−m1i=1∑m[x(i)(I{y(i)=j}−p(y(i)=j∣x(i);θ))]+λθj
2.5. Softmax与Logistic回归的关系
Logistic回归算法是Softmax回归的特征情况,即
k
=
2
k=2
k=2时的情况,当
k
=
2
k=2
k=2时,Softmax回归为:
h θ ( x ) = 1 e θ 1 T x + e θ 2 T x [ e θ 1 T x e θ 2 T x ] h_\theta \left ( x \right )=\frac{1}{e^{\theta _1^Tx}+e^{\theta _2^Tx}}\begin{bmatrix} e^{\theta _1^Tx}\\ e^{\theta _2^Tx} \end{bmatrix} hθ(x)=eθ1Tx+eθ2Tx1[eθ1Txeθ2Tx]
利用Softmax回归参数冗余的特点,令 ψ = θ 1 \psi =\theta _1 ψ=θ1,从两个向量中都减去这个向量,得到:
h θ ( x ) = 1 e ( θ 1 − ψ ) T x + e ( θ 2 − ψ ) T x [ e ( θ 1 − ψ ) T x e ( θ 2 − ψ ) T x ] = [ 1 1 + e ( θ 2 − θ 1 ) T x e ( θ 2 − θ 1 ) T x 1 + e ( θ 2 − θ 1 ) T x ] = [ 1 1 + e ( θ 2 − θ 1 ) T x 1 − 1 1 + e ( θ 2 − θ 1 ) T x ] \begin{matrix} h_\theta \left ( \mathbf{x} \right )=\frac{1}{e^{(\theta _1-\psi )^T\mathbf{x}}+e^{(\theta _2-\psi )^T\mathbf{x}}}\begin{bmatrix} e^{(\theta _1-\psi )^T\mathbf{x}}\\ e^{(\theta _2-\psi )^T\mathbf{x}} \end{bmatrix}\\ =\begin{bmatrix} \frac{1}{1+e^{(\theta _2-\theta _1 )^T\mathbf{x}}}\\ \frac{e^{(\theta _2-\theta _1 )^T\mathbf{x}}}{1+e^{(\theta _2-\theta _1 )^T\mathbf{x}}} \end{bmatrix}\\ =\begin{bmatrix} \frac{1}{1+e^{(\theta _2-\theta _1 )^T\mathbf{x}}}\\ 1-\frac{1}{1+e^{(\theta _2-\theta _1 )^T\mathbf{x}}} \end{bmatrix} \end{matrix} hθ(x)=e(θ1−ψ)Tx+e(θ2−ψ)Tx1[e(θ1−ψ)Txe(θ2−ψ)Tx]=[1+e(θ2−θ1)Tx11+e(θ2−θ1)Txe(θ2−θ1)Tx]=[1+e(θ2−θ1)Tx11−1+e(θ2−θ1)Tx1]
上述的表达形式与Logistic回归是一致的。
2.6. 多分类算法和二分类算法的选择
有人会觉得对于一个多分类问题,可以使用多个二分类来完成,对于多分类问题是直接选择多分类的分类器还是选择多个二分类的分类器进行叠加,在UFLDL中,作者给出了这样的解释:取决于类别之间是否互斥。
对于一个多分类的问题,是直接选择多分类器直接计算还是选择多个二分类器进行计算取决于问题中类别之间是否互斥。
- 是互斥的 --> Softmax回归
- 不是互斥的 --> 多个独立的Logistic回归
对于Softmax回归更多内容,包括实验可见博客简单易学的机器学习算法——Softmax Regression
参考文献
[1] 英文版:UFLDL Tutorial
[2] 中文版:UFLDL教程
[3] 《Python机器学习算法》第2章 Softmax Regression
更多推荐





所有评论(0)