一、正余弦位置编码

1. 原始公式

{ P E ( p o s , 2 i ) = sin ⁡ ( p o s / 1000 0 2 i / d model ) P E ( p o s , 2 i + 1 ) = cos ⁡ ( p o s / 1000 0 2 i / d model ) (1) \left\{\begin{aligned} &PE(pos,2i) = \sin\left(pos/10000^{2i/d_{\text{model}}}\right) \\ &PE(pos,2i+1) = \cos\left(pos/10000^{2i/d_{\text{model}}}\right) \end{aligned}\right. \tag{1} PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel)(1)

  1. 参数含义
    - 词在句中的绝对位置 p o s ∈ [ 0 , n − 1 ] pos∈[0,n-1] pos[0,n1],其中 n n n 为词序列长度
    - i ∈ [ 0 , ( d model / 2 ) − 1 ] i∈[0,(d_{\text{model}}/2)-1] i[0,(dmodel/2)1]
    - ω 2 i = 1 / 1000 0 2 i / d model \omega_{2i} = 1/10000^{2i/d_{\text{model}}} ω2i=1/100002i/dmodel

2. 性质

  1. 性质1:可线性表示的相对位置
    设2个位置 P E ( t ) PE(t) PE(t) P E ( t + g ) PE(t+g) PE(t+g),根据三角恒等式,有 [1] ^{\text{[1]}} [1]
    P E ( t + g , 2 i ) = sin ⁡ ( t ⋅ w 2 i + g ⋅ w 2 i ) = sin ⁡ ( t ⋅ w 2 i ) cos ⁡ ( g ⋅ w 2 i ) + cos ⁡ ( t ⋅ w 2 i ) sin ⁡ ( g ⋅ w 2 i ) = P E ( t , 2 i ) ⋅ P E ( g , 2 i + 1 ) + P E ( t , 2 i + 1 ) ⋅ P E ( g , 2 i ) = P E ( t , 2 i ) ⋅ u + P E ( t , 2 i + 1 ) ⋅ v (2) \begin{aligned} PE(t+g,2i) &= \sin(t \cdot w_{2i} + g \cdot w_{2i}) \\ &= \sin(t \cdot w_{2i})\cos(g \cdot w_{2i}) + \cos(t \cdot w_{2i})\sin(g \cdot w_{2i}) \\ &= PE(t,2i) \cdot PE(g,2i+1) + PE(t,2i+1) \cdot PE(g,2i) \\ &= PE(t,2i) \cdot u + PE(t,2i+1) \cdot v \end{aligned} \tag{2} PE(t+g,2i)=sin(tw2i+gw2i)=sin(tw2i)cos(gw2i)+cos(tw2i)sin(gw2i)=PE(t,2i)PE(g,2i+1)+PE(t,2i+1)PE(g,2i)=PE(t,2i)u+PE(t,2i+1)v(2)
    P E ( t + g , 2 i + 1 ) = cos ⁡ ( t ⋅ w 2 i + g ⋅ w 2 i ) = cos ⁡ ( t ⋅ w 2 i ) cos ⁡ ( g ⋅ w 2 i ) − sin ⁡ ( t ⋅ w 2 i ) sin ⁡ ( g ⋅ w 2 i ) = P E ( t , 2 i + 1 ) ⋅ P E ( g , 2 i + 1 ) − P E ( t , 2 i ) ⋅ P E ( g , 2 i ) = P E ( t , 2 i + 1 ) ⋅ u − P E ( t , 2 i ) ⋅ v (3) \begin{aligned} PE(t+g,2i+1) &= \cos\left(t \cdot w_{2i} + g \cdot w_{2i}\right) \\ &= \cos\left(t \cdot w_{2i}\right)\cos\left(g \cdot w_{2i}\right) - \sin\left(t \cdot w_{2i}\right)\sin\left(g \cdot w_{2i}\right) \\ &= PE(t,2i+1) \cdot PE(g,2i+1) - PE(t,2i) \cdot PE(g,2i) \\ &= PE(t,2i+1) \cdot u - PE(t,2i) \cdot v \end{aligned} \tag{3} PE(t+g,2i+1)=cos(tw2i+gw2i)=cos(tw2i)cos(gw2i)sin(tw2i)sin(gw2i)=PE(t,2i+1)PE(g,2i+1)PE(t,2i)PE(g,2i)=PE(t,2i+1)uPE(t,2i)v(3)
    其中 u u u, v v v为关于相对距离 g g g的常数,所以可以证明 P E ( t + g ) PE(t+g) PE(t+g)可以由 P E ( t ) PE(t) PE(t)线性表示。
  2. 性质2:从自注意力公式开始推导,内积只和相对位置 g g g有关,且对称
    • 矩阵 K V T KV^T KVT中的某元素 A t , t + g A_{t,t+g} At,t+g
      A t , t + g = Query t ⋅ Key t + g = ∑ m = 1 d Query t , m × Key t + g , m = ∑ m = 1 d [ ( E ( t , m ) + P E ( t , m ) ) W m Q ] × [ ( E ( t + g , m ) + P E ( t + g , m ) W m K ] (4) \begin{aligned} A_{t,t+g} &= \text{Query}_{t} \cdot \text{Key}_{t+g} = \sum_{m=1}^{d} \text{Query}_{t,m} \times \text{Key}_{t+g,m} \\ &= \sum_{m=1}^{d} \left[(E(t,m) + PE(t,m)) W_m^Q\right] \times \left[(E(t+g,m) + PE(t+g,m) W_m^K\right] \end{aligned} \tag{4} At,t+g=QuerytKeyt+g=m=1dQueryt,m×Keyt+g,m=m=1d[(E(t,m)+PE(t,m))WmQ]×[(E(t+g,m)+PE(t+g,m)WmK](4)
    • 考虑其中的与位置相关的内积项 [1] ^{\text{[1]}} [1]:
      P E ( t ) ⋅ P E ( t + g ) = ∑ i = 0 d / 2 − 1 P E ( t , 2 i ) ⋅ P E ( t + g , 2 i ) + ∑ i = 0 d / 2 − 1 P E ( t , 2 i + 1 ) ⋅ P E ( t + g , 2 i + 1 ) = ∑ i = 0 d / 2 − 1 sin ⁡ ( t ⋅ w 2 i ) ⋅ sin ⁡ [ ( t + g ) ⋅ w 2 i ] + ∑ i = 0 d / 2 − 1 cos ⁡ ( t ⋅ w 2 i ) ⋅ cos ⁡ [ ( t + g ) ⋅ w 2 i ] = ∑ i = 0 d / 2 − 1 cos ⁡ ( g ⋅ w 2 i ) (5) \begin{aligned} PE(t) \cdot PE(t + g) &= \sum_{i=0}^{d/2-1} PE(t, 2i) \cdot PE(t + g, 2i) + \sum_{i=0}^{d/2-1} PE(t, 2i + 1) \cdot PE(t + g, 2i + 1) \\ &= \sum_{i=0}^{d/2-1} \sin (t \cdot w_{2i}) \cdot \sin [(t + g) \cdot w_{2i}] + \sum_{i=0}^{d/2-1} \cos (t \cdot w_{2i}) \cdot \cos [(t + g) \cdot w_{2i}] \\ &= \sum_{i=0}^{d/2-1} \cos (g \cdot w_{2i}) \end{aligned} \tag{5} PE(t)PE(t+g)=i=0d/21PE(t,2i)PE(t+g,2i)+i=0d/21PE(t,2i+1)PE(t+g,2i+1)=i=0d/21sin(tw2i)sin[(t+g)w2i]+i=0d/21cos(tw2i)cos[(t+g)w2i]=i=0d/21cos(gw2i)(5)
    • 结论:
      • 与位置相关的内积项的结果是关于 g g g的常数。易知 P E ( t + g ) P E ( t ) = P E ( t ) P E ( t − g ) PE(t+g)PE(t) = PE(t)PE(t-g) PE(t+g)PE(t)=PE(t)PE(tg),这表明正余弦编码有对称性 [1] ^{\text{[1]}} [1]
      • P E ( ⋅ ) PE(·) PE()影响可学习参数,即投影矩阵 W W W

3. 与其他方法对比

  1. VS简单线性编码: P E ( p o s ) = p o s PE(pos) = pos PE(pos)=pos,可得 P E ( p o s + k ) = p o s + k = P E ( p o s ) + k PE(pos+k) = pos + k = PE(pos) + k PE(pos+k)=pos+k=PE(pos)+k。缺点:值域无界导致归一化时不稳定,无维度间关联。正余弦函数值域 [ − 1 , 1 ] [-1,1] [1,1],稳定。
  2. VS可学习位置编码:《Attention is all you need》还尝试过采用可学习的位置嵌入,发现两种方法效果基本相同,选正弦版本是因为可以允许模型推断出比训练期间遇到的序列长度更长的序列。(笔者猜测:2种方法效果相近,是因为:虽然 P E PE PE固定,但训练过程中 P E PE PE也参与了 W W W的学习)
  3. 实际上因为2位置项 P E PE PE,无法直接相乘,会被投影项 W W W影响,所以并未达到只与向量和相对位置有关的目的

二、旋转位置编码 [ 2 ] ^{[2]} [2]

1. 目标

  1. 相似性计算只依赖 向量相对距离,不依赖其绝对位置
  2. 期望的形式(三元函数): Q i K j T = g ( X i , X j , i − j ) Q_iK^T_j=g(X_i,X_j,i-j) QiKjT=g(Xi,Xj,ij)

2. 背景(旋转矩阵)

  1. 二维旋转矩阵
    • 定义: R ( θ ) = ( cos ⁡ θ sin ⁡ θ − sin ⁡ θ cos ⁡ θ ) R(\theta)=\begin{pmatrix} \cos \theta & \sin \theta \\ -\sin \theta & \cos \theta \end{pmatrix} R(θ)=(cosθsinθsinθcosθ)
    • X R ( θ ) XR(\theta) XR(θ)的物理意义:对 X X X逆时针旋转 θ \theta θ
    • 性质:
      • R ( θ ) T = R ( − θ ) R(\theta)^T=R(-\theta) R(θ)T=R(θ)
      • R ( θ 1 ) R ( θ 2 ) = R ( θ 1 + θ 2 ) R(\theta_1)R(\theta_2)=R(\theta_1+\theta_2) R(θ1)R(θ2)=R(θ1+θ2)
  2. 高维旋转矩阵
    • 定义:假设空间是偶数维的,把原始空间切分成一个个独立正交的二维子空间,在上面做独立的旋转

    • 公式:

      • 基础旋转角度序列: Θ = ( θ 1 , θ 2 , . . . , θ D / 2 ) \Theta=(\theta_1,\theta_2,...,\theta_{D/2}) Θ=(θ1,θ2,...,θD/2)
      • R ( Θ ) = ( R ( θ 1 ) 0 0 0 0 R ( θ 2 ) 0 0 0 0 … 0 0 0 0 R ( θ D / 2 ) ) R(\Theta)=\begin{pmatrix}R(\theta_{1})&0&0&0\\0&R(\theta_{2})&0&0\\0&0&\ldots&0\\0&0&0&R(\theta_{D/2})\end{pmatrix} R(Θ)= R(θ1)0000R(θ2)00000000R(θD/2)
      • 对每个子空间分别做旋转: R ( Θ ) = ( R ( θ 1 ) 0 0 R ( θ 2 ) ) = ( R ( θ 1 ) 0 0 1 ) ( 1 0 0 R ( θ 2 ) ) = R ^ ( θ 1 ) R ^ ( θ 2 ) R(\Theta)=\begin{pmatrix}R(\theta_{1})&0\\0&R(\theta_{2})\end{pmatrix}=\begin{pmatrix}R(\theta_{1})&0\\0&1\end{pmatrix}\begin{pmatrix}1&0\\0&R(\theta_{2})\end{pmatrix}=\widehat{R}(\theta_{1})\widehat{R}(\theta_{2}) R(Θ)=(R(θ1)00R(θ2))=(R(θ1)001)(100R(θ2))=R (θ1)R (θ2)
    • 物理意义:在独立的二维子空间做不同角度的旋转
      X R ( Θ ) = ( X 1 , X 2 ) ( R ( θ 1 ) 0 0 R ( θ 2 ) ) = ( X 1 R ( θ 1 ) , X 2 R ( θ 2 ) ) XR(\Theta)=(X^1,X^2)\begin{pmatrix}R(\theta_1)&0\\0&R(\theta_2)\end{pmatrix}=(X^1R(\theta_1),X^2R(\theta_2)) XR(Θ)=(X1,X2)(R(θ1)00R(θ2))=(X1R(θ1),X2R(θ2))

3. 旋转位置编码

  1. 动机:把两个向量各自按照 i , j i,j i,j角度旋转后,再计算点积;新向量的内积带上位置信息;模长未变,夹角增加 ( j − i ) (j-i) (ji)
  2. 二维空间的一个解
    • 公式:
      Q i = X i W Q R ( i θ ) K j = X j W K R ( j θ ) Q i K j T = X i W Q R ( i θ ) R ( j θ ) T W K T X j T = X i W Q R ( i θ ) R ( − j θ ) W K T X j T = X i W Q R ( ( i − j ) θ ) W K T X j T = g ( X i , X j , i − j ) \begin{aligned}{Q_i}&=X_iW_QR(i\theta)\\K_{j}&=X_jW_KR(j\theta)\\Q_iK_j^T&=X_iW_QR(i\theta)R(j\theta)^TW_K^TX_j^T\\&=X_iW_QR(i\theta)R(-j\theta)W_K^TX_j^T\\&=X_iW_QR((i-j)\theta)W_K^TX_j^T\\&=g(X_i,X_j,i-j)\end{aligned} QiKjQiKjT=XiWQR(iθ)=XjWKR(jθ)=XiWQR(iθ)R(jθ)TWKTXjT=XiWQR(iθ)R(jθ)WKTXjT=XiWQR((ij)θ)WKTXjT=g(Xi,Xj,ij)
    • 注意:在投影后做旋转,以便两个 R R R 项,能合并
  3. 高维空间的解
    • 公式:
      Q i = X i W Q R ( i Θ ) K j = X j W K R ( j Θ ) Q i K j T = X i W Q R ( i Θ ) R ( j Θ ) T W K T X j T = X i W Q R ( i Θ ) R ( − j Θ ) W K T X j T = X i W Q R ( ( i − j ) Θ ) W K T X j T = g ( X i , X j , i − j ) \begin{aligned}{Q_i}&=X_iW_QR(i\Theta)\\K_{j}&=X_jW_KR(j\Theta)\\Q_iK_j^T&=X_iW_QR(i\Theta)R(j\Theta)^TW_K^TX_j^T\\&=X_iW_QR(i\Theta)R(-j\Theta)W_K^TX_j^T\\&=X_iW_QR((i-j)\Theta)W_K^TX_j^T\\&=g(X_i,X_j,i-j)\end{aligned} QiKjQiKjT=XiWQR(iΘ)=XjWKR(jΘ)=XiWQR(iΘ)R(jΘ)TWKTXjT=XiWQR(iΘ)R(jΘ)WKTXjT=XiWQR((ij)Θ)WKTXjT=g(Xi,Xj,ij)
    • 其中:
      • i i i 位置的旋转角度序列: i Θ = ( i θ 1 , i θ 2 , . . . , i θ d ) i\Theta=(i\theta_1,i\theta_2,...,i\theta_d) iΘ=(iθ1,iθ2,...,iθd),其中 θ k = 1000 0 − k / d , k ∈ [ 1 , 2 , … , d ] \begin{aligned}\theta_k=10000^{-k/d},k\in[1,2,\ldots,d]\end{aligned} θk=10000k/d,k[1,2,,d]
      • 推导: R ( i Θ ) R ( j Θ ) T = R ^ ( i θ 1 ) R ^ ( i θ 2 ) … R ^ ( i θ d ) R ^ ( j θ d ) T … R ^ ( j θ 2 ) T R ^ ( j θ 1 ) T = ( R ^ ( i θ 1 ) R ^ ( j θ 1 ) T ) ( R ^ ( i θ 2 ) R ^ ( j θ 2 ) T ) … ( R ^ ( i θ d ) R ^ ( j θ d ) T ) = R ^ ( ( i − j ) θ 1 ) R ^ ( ( i − j ) θ 2 ) … R ^ ( ( i − j ) θ d ) = R ( ( i − j ) Θ ) \begin{aligned}R(i\Theta)R(j\Theta)^T&=\widehat{R}(i\theta_1)\widehat{R}(i\theta_2)\ldots\widehat{R}(i\theta_d)\widehat{R}(j\theta_d)^T\ldots\widehat{R}(j\theta_2)^T\widehat{R}(j\theta_1)^T\\&=(\widehat{R}(i\theta_1)\widehat{R}(j\theta_1)^T)(\widehat{R}(i\theta_2)\widehat{R}(j\theta_2)^T)\ldots(\widehat{R}(i\theta_d)\widehat{R}(j\theta_d)^T)\\&=\widehat{R}((i-j)\theta_1)\widehat{R}((i-j)\theta_2)\ldots\widehat{R}((i-j)\theta_d)\\&=R((i-j)\Theta)\end{aligned} R(iΘ)R(jΘ)T=R (iθ1)R (iθ2)R (iθd)R (jθd)TR (jθ2)TR (jθ1)T=(R (iθ1)R (jθ1)T)(R (iθ2)R (jθ2)T)(R (iθd)R (jθd)T)=R ((ij)θ1)R ((ij)θ2)R ((ij)θd)=R((ij)Θ)
    • PS:结合律(任意矩阵都满足),交换率(充要/充分条件:对称阵、对角阵等 [ 4 ] ^{[4]} [4]
  4. 整体看下
    • 区分度
      • 结论:随着位置的增大,旋转角度不会重复
      • 证明:假设存在 i , j i,j i,j 位置,使得 j θ k − i θ k = 2 m π j\theta_k-i\theta_k=2m\pi jθkiθk=2 m m m 是个整数,那么 θ k = 2 m π / ( j − i ) \theta_k=2m\pi/(j-i) θk=2/(ji)。仅当 θ k \theta_k θk中有无理数 π \pi π时,等式才成立
    • 可能的另一个优势:在每个前向传播中的 block 都会做位置编码,故位置信息 不像正余弦位置编码(仅在第一个block前做一次位置编码)那样易丢失

4. 与正余弦位置编码对比

  1. 正余弦法:模长、角度的改变结果很复杂。因为是 X X X+正余弦项
  2. 旋转矩阵法:模长不变,角度改变 j − i j-i ji。可预测,更稳定。

参考文献

  1. 再论大模型位置编码及其外推性
  2. 解密旋转位置编码:数学基础、代码实现与绝对编码一体化探索
  3. 终于知道Transformer 为啥离不开 RoPE了
  4. 矩阵乘法可交换的充要条件是?
Logo

更多推荐