commit

jayarnim · Jan 27, 2025 · 68d1bef · 68d1bef
1 parent e8322ff
commit 68d1bef
Show file tree

Hide file tree

Showing 2 changed files with 76 additions and 41 deletions.
diff --git a/_post_refer_img/TextAnalytics/09-05.png b/_post_refer_img/TextAnalytics/09-05.png
diff --git a/_posts/TextAnalytics/2024-08-06-ATTN.md b/_posts/TextAnalytics/2024-08-06-ATTN.md
@@ -24,12 +24,15 @@ image:
 
     $$\begin{aligned}
     \text{ATTN}\left(\mathcal{Q},\mathcal{K},\mathcal{V}\right)
-    = \underbrace{\text{Softmax}\left[f(\mathcal{Q},\mathcal{K})\right]}_{\text{Attention Weight}} \cdot \mathcal{V}
+    = \text{Softmax}\left[f(\mathcal{Q},\mathcal{K})\right] \cdot \mathcal{V}
     \end{aligned}$$
 
     - $\mathcal{Q}$ : 입력값에 대하여 정보를 얻고자 하는 기준점으로서 **질의(Query)**
     - $\mathcal{K}$ : 질의와 매칭하여 관련성을 평가할 기준으로서 **키(Key)**
     - $\mathcal{V}$ : 관련성을 기반으로 반환할 값으로서 **선택지(Value)**
+    - $\mathcal{A}=f(\mathcal{Q},\mathcal{K})$ : 질의와 키 간 유사도 행렬로서 **주의 점수 행렬(Attention Score Matrix)**
+    - $\mathcal{W}=\text{Softmax}\left[\mathcal{A}\right]$ : 주의 점수 정규화 행렬로서 **주의 가중치 행렬(Attention Weight Matrix)** 혹은 **어텐션 맵(Attention Map)**
+    - $\mathcal{O}=\mathcal{W} \cdot \mathcal{V}$ : 
 
 - **`Example` Recommendation** (Latent Factor Model)
 
@@ -43,58 +46,88 @@ image:
     | $$\overrightarrow{\mathbf{k}}_{j} \in \mathbf{K}_{N \times D}$$ | 아이템 프로파일 벡터 | 반환할 값에 대하여 질의와 매칭하여 관련성을 평가할 기준 |
 
     $$\begin{aligned}
-    \overrightarrow{\mathbf{w}}_{i}
+    \overrightarrow{\mathcal{o}}_{i}
     = \text{ATTN}\left(\overrightarrow{\mathbf{q}}_{i}, \mathbf{K}, \mathbf{V}\right)
     \in \mathbb{R}^{N}
     \end{aligned}$$
 
-- **Attention Score Function**
+## Attention Score Function
+-----
 
-    | Name | Function | Defined by |
-    |---|---|---|
-    | Dot Product | $$f(\overrightarrow{\mathbf{q}}, \mathbf{K}) = \overrightarrow{\mathbf{q}} \cdot \mathbf{K}$$ | Luong et al. (2015) |
-    | Learnable Weighted Attention | $$f(\overrightarrow{\mathbf{q}}, \mathbf{K}) = \overrightarrow{\mathbf{q}}^{T} \cdot \mathbf{W} \cdot \mathbf{K}$$ | Luong et al. (2015) |
-    | Additive Attention | $$f(\overrightarrow{\mathbf{q}}, \mathbf{K}) = \mathbf{W}^{T}_{A} \cdot \text{tanh}\left[\mathbf{W}_{B} \cdot (\overrightarrow{\mathbf{q}} \oplus \mathbf{K})\right]$$ | Bahdanau et al. (2015) |
-    | Concatenation | $$f(\overrightarrow{\mathbf{q}}, \mathbf{K}) = \mathbf{W}^{T}_{A} \cdot \text{tanh}\left[\mathbf{W}_{B} \cdot \overrightarrow{\mathbf{q}} + \mathbf{W}_{C} \cdot \mathbf{K}\right]$$ | Bahdanau et al. (2015) |
-    | Scaled Dot Product | $$f(\overrightarrow{\mathbf{q}}, \mathbf{K}) = \displaystyle\frac{\overrightarrow{\mathbf{q}}^{T} \cdot \mathbf{K}}{\sqrt{n}}$$ | Vaswani et al. (2017) |
+| Name | Function | Defined by |
+|---|---|---|
+| Dot Product | $$f(\overrightarrow{\mathbf{q}}, \mathbf{K}) = \overrightarrow{\mathbf{q}} \cdot \mathbf{K}$$ | Luong et al. (2015) |
+| Learnable Weighted Attention | $$f(\overrightarrow{\mathbf{q}}, \mathbf{K}) = \overrightarrow{\mathbf{q}}^{T} \cdot \mathbf{W} \cdot \mathbf{K}$$ | Luong et al. (2015) |
+| Additive Attention | $$f(\overrightarrow{\mathbf{q}}, \mathbf{K}) = \mathbf{W}^{T}_{A} \cdot \text{tanh}\left[\mathbf{W}_{B} \cdot (\overrightarrow{\mathbf{q}} \oplus \mathbf{K})\right]$$ | Bahdanau et al. (2015) |
+| Concatenation | $$f(\overrightarrow{\mathbf{q}}, \mathbf{K}) = \mathbf{W}^{T}_{A} \cdot \text{tanh}\left[\mathbf{W}_{B} \cdot \overrightarrow{\mathbf{q}} + \mathbf{W}_{C} \cdot \mathbf{K}\right]$$ | Bahdanau et al. (2015) |
+| Scaled Dot Product | $$f(\overrightarrow{\mathbf{q}}, \mathbf{K}) = \displaystyle\frac{\overrightarrow{\mathbf{q}}^{T} \cdot \mathbf{K}}{\sqrt{n}}$$ | Vaswani et al. (2017) |
 
-## Multi-Head Attention
+## Adaptive Weight
 -----
 
-- **가중 어텐션(Weighted Attention)** : 표현력을 강화하기 위하여 학습 가능한 가중치 행렬 $\mathcal{W}$ 을 활용하여 $\mathcal{Q}, \mathcal{K}, \mathcal{V}$ 를 선형 변환하고, 이를 기반으로 유사도를 계산하여 입력 간 상호작용 정보를 동적으로 학습하는 기법
+- **적응형 가중치(Adaptive Weight)** : 표현력을 강화하기 위하여 학습 가능한 가중치 행렬 $\mathbf{W}$ 을 활용하여 $\mathcal{Q}, \mathcal{K}, \mathcal{V}$ 를 선형 변환하고, 이를 기반으로 유사도를 계산하여 입력 간 상호작용 정보를 동적으로 학습하는 기법
 
-    $$\begin{aligned}
-    \mathcal{Q}
-    &= \mathbf{Q}_{M \times D_{Q}} \cdot \mathcal{W}^{(Q)}_{D_{Q} \times D} &\in \mathbb{R}^{M \times D}\\
-    \mathcal{K}
-    &= \mathbf{K}_{N \times D_{K}} \cdot \mathcal{W}^{(K)}_{D_{K} \times D} &\in \mathbb{R}^{N \times D}\\
-    \mathcal{V}
-    &= \mathbf{V}_{N \times D_{V}} \cdot \mathcal{W}^{(V)}_{D_{V} \times D} &\in \mathbb{R}^{N \times D}
-    \end{aligned}$$
+    | WHAT | INPUT DATA | LEARNABLE WEIGHT | TOTAL |
+    |---|---|---|---|
+    | $$\mathcal{Q}$$ | $$\mathbf{Q} \in \mathbb{R}^{M \times D_{Q}}$$ | $$\mathbf{W}_{Q} \in \mathbb{R}^{D_{Q} \times D}$$ | $$\mathbf{Q} \cdot \mathbf{W}_{Q} \in \mathbb{R}^{M \times D}$$ |
+    | $$\mathcal{K}$$ | $$\mathbf{K} \in \mathbb{R}^{N \times D_{K}}$$ | $$\mathbf{W}_{K} \in \mathbb{R}^{D_{K} \times D}$$ | $$\mathbf{K} \cdot \mathbf{W}_{K} \in \mathbb{R}^{N \times D}$$ |
+    | $$\mathcal{V}$$ | $$\mathbf{V} \in \mathbb{R}^{N \times D_{V}}$$ | $$\mathbf{W}_{V} \in \mathbb{R}^{D_{V} \times D}$$ | $$\mathbf{V} \cdot \mathbf{W}_{V} \in \mathbb{R}^{N \times D}$$ |
+
+- **WHY? NECESSITY**
 
     - $\mathcal{Q}, \mathcal{K}, \mathcal{V}$ 간 **차원 보정이** 필요한 경우
-        - $\mathcal{Q}$ 와 $\mathcal{K}$ 가 서로 다른 특징 차원을 가지고 있는 경우
+        - $\mathcal{Q}$ 와 $\mathcal{K}$ 가 서로 다른 특징 차원을 가지고 있는 경우($D_{Q} \ne D_{K} \ne D_{V}$)
         - $\mathcal{Q}$ 와 $\mathcal{K}$ 가 서로 같은 특징 차원을 공유하고 있으나 상황에 따라 유사도가 중요도에 작용하는 방향(선호/비선호)이 다를 경우
+
     - 상호작용 정보를 포착함에 있어서 유사도 이상의 **복잡한 관계를 조명하고자 하는 경우**
         - 셀프 어텐션에서, 입력값과 반환할 값은 동일하나 역할($\mathcal{Q}, \mathcal{K}, \mathcal{V}$)에 따라 서로 다른 정보 처리를 수행해야 하는 경우
         - 멀티 헤드 어텐션에서, 헤드마다 상호작용 정보를 다각도로 포착하고자 하는 경우
 
-- **멀티 헤드 어텐션(Multi-Head Attention)** : 입력 데이터를 여러 독립적인 어텐션 메커니즘(헤드)으로 병렬 처리하여, 데이터 간 관계와 패턴을 다각도로 학습하는 기법
+## Variations
+-----
+
+- **멀티 헤드 어텐션(Multi-Head Attention)** : 입력 데이터를 여러 독립적인 적응형 가중 어텐션 메커니즘(헤드)으로 병렬 처리하여, 데이터 간 관계와 패턴을 다각도로 학습하는 기법
 
     ![03](/_post_refer_img/TextAnalytics/09-03.png){: width="100%"}
 
-    - 하나의 헤드는 독립적인 가중 어텐션으로 이루어짐
+    - 하나의 헤드는 독립적인 적응형 가중 어텐션으로 이루어짐
 
         $$\begin{aligned}
         \text{HEAD}^{(h)}
-        &= \text{ATTN}^{(h)}\left(\mathbf{Q} \cdot \mathcal{W}_{Q}^{(h)}, \mathbf{K} \cdot \mathcal{W}_{K}^{(h)}, \mathbf{V} \cdot \mathcal{W}_{V}^{(h)}\right)
+        &= \text{ATTN}^{(h)}\left(\mathbf{Q} \cdot \mathbf{W}_{\mathcal{Q}}^{(h)}, \mathbf{K} \cdot \mathbf{W}_{\mathcal{K}}^{(h)}, \mathbf{V} \cdot \mathbf{W}_{\mathcal{V}}^{(h)}\right)
         \end{aligned}$$
 
     - 멀티 헤드 어텐션의 결과값은 헤드별 결과값의 벡터 결합을 선형 변환한 값임
 
         $$\begin{aligned}
         \text{Multi-Head}\left(\mathbf{Q}, \mathbf{K}, \mathbf{V}\right)
-        &= \left[\cdots \oplus \text{HEAD}^{(h)} \oplus \cdots \right] \cdot \mathcal{W}_{\mathcal{O}}
+        &= \left[\cdots \oplus \text{HEAD}^{(h)} \oplus \cdots \right] \cdot \mathbf{W}_{\mathcal{O}}
+        \end{aligned}$$
+
+- **셀프 어텐션(Self-Attention)** : 입력값과 반환할 값이 같은 경우로서, 통상 역할($\mathcal{Q}, \mathcal{K}, \mathcal{V}$)에 따라 서로 다른 정보 처리를 수행하도록 적응형 가중 어텐션과 결합되어 활용됨
+
+    ![05](/_post_refer_img/TextAnalytics/09-05.png){: width="100%"}
+
+    $$\begin{aligned}
+    \mathcal{Q}=\mathbf{X} \cdot \mathbf{W}^{(Q)}, \quad \mathcal{K}=\mathbf{X} \cdot \mathbf{W}^{(K)}, \quad \mathcal{Q}=\mathbf{V} \cdot \mathbf{W}^{(V)}
+    \end{aligned}$$
+
+- **마스크 행렬(Mask Matrix)** : (특히 시계열 데이터에 대한 셀프 어텐션에서) 특정 위치에 대한 가중치를 차단하거나 조정함으로써 다음 순번에 관한 정보가 유실되는 것을 방지함
+
+    $$\begin{aligned}
+    f\left(\mathcal{Q}_{M \times D}, \mathcal{K}_{N \times D}\right) + \mathbf{M}_{M \times N}
+    \end{aligned}$$
+
+    - 셀프 어텐션은 입력 데이터의 모든 위치가 서로 영향을 주고받는 구조이나, 시퀀스 생성 모형은 **현재 순번 데이터로만 다음 순번을 예측해야 하므로,** 미래 위치에 대한 주의 점수를 $-\infty$ 로 처리하여 모형이 해당 정보를 보지 못하도록 강제함
+
+        $$\begin{aligned}
+        \mathbf{M}_{M \times N}
+        = \begin{pmatrix}
+        0 & -\infty & -\infty & \cdots & -\infty \\
+        0 & 0 & -\infty & \cdots & -\infty \\
+        \vdots & \vdots & \vdots & \ddots & \vdots \\
+        0 & 0 & 0 & \cdots & 0
+        \end{pmatrix}
         \end{aligned}$$
 
 ## Application to SEQ2SEQ
@@ -113,25 +146,26 @@ image:
 - Attention Mechanism
 
     $$
-    \mathbf{W}^{(t)} = \text{Softmax}\left[\overrightarrow{\mathbf{q}}_{t} \cdot \mathbf{K}\right] \cdot \mathbf{V}
+    \mathcal{O}^{(t)}
+    = \text{Softmax}\left[\eta_{t} \cdot \mathbf{H}\right] \cdot \mathbf{H}
     $$
 
-    - $$\overrightarrow{\mathbf{q}}_{t} = \eta_{t}$$ : 디코더의 $t$ 시점 은닉 상태
-    - $$\mathbf{K} = \mathbf{V} = \mathbf{H}$$ : 인코더의 각 순번 은닉 상태 행렬
+    - $$\overrightarrow{\mathcal{q}}_{t} = \eta_{t}$$ : 디코더의 $t$ 시점 은닉 상태
+    - $$\mathcal{K} = \mathcal{V} = \mathbf{H}$$ : 인코더의 각 순번 은닉 상태 행렬
 
 - 디코더의 $t$ 시점 특화 문맥 벡터(Context Vector) 도출
 
     $$\begin{aligned}
-    \overrightarrow{\mathbf{z}}_{t}
-    = \sum_{i}{\mathbf{W}^{(t)}_{i}}
-    = \overrightarrow{\mathbf{w}}^{(t)}_{1} + \overrightarrow{\mathbf{w}}^{(t)}_{2} + \cdots + \overrightarrow{\mathbf{w}}^{(t)}_{T}
+    \overrightarrow{\mathbf{c}}_{t}
+    = \sum_{i}{\mathcal{O}^{(t)}_{i}}
+    = \overrightarrow{\mathcal{o}}^{(t)}_{1} + \overrightarrow{\mathcal{o}}^{(t)}_{2} + \cdots + \overrightarrow{\mathcal{o}}^{(t)}_{T}
     \end{aligned}$$
 
 - $t$ 시점 문맥 벡터와 $t$ 시점 은닉 상태 정보 종합
 
     $$
-    \overrightarrow{\mathbf{s}}_{t}
-    = \text{F}_{\text{tanh}}\left[\overrightarrow{\mathbf{z}}_{t} \oplus \eta_{t}\right]
+    \overrightarrow{\mathbf{z}}_{t}
+    = \text{F}_{\text{tanh}}\left[\overrightarrow{\mathbf{c}}_{t} \oplus \eta_{t}\right]
     $$
 
 ### Bahdanau Attention
@@ -141,25 +175,26 @@ image:
 - Attention Mechanism
 
     $$
-    \mathbf{W}^{(t)} = \text{Softmax}\left[\overrightarrow{\mathbf{q}}_{t} \cdot \mathbf{K}\right] \cdot \mathbf{V}
+    \mathbf{O}^{(t)}
+    = \text{Softmax}\left[\eta_{t-1} \cdot \mathbf{H}\right] \cdot \mathbf{H}
     $$
 
-    - $$\overrightarrow{\mathbf{q}}_{t} = \eta_{t-1}$$ : 디코더의 $t-1$ 시점 은닉 상태
-    - $$\mathbf{K} = \mathbf{V} = \mathbf{H}$$ : 인코더의 각 순번 은닉 상태 행렬
+    - $$\overrightarrow{\mathcal{q}}_{t} = \eta_{t-1}$$ : 디코더의 $t-1$ 시점 은닉 상태
+    - $$\mathcal{K} = \mathcal{V} = \mathbf{H}$$ : 인코더의 각 순번 은닉 상태 행렬
 
 - 디코더의 $t$ 시점 특화 문맥 벡터(Context Vector) 도출
 
     $$\begin{aligned}
-    \overrightarrow{\mathbf{z}}_{t}
-    = \sum_{i}{\mathbf{W}^{(t)}_{i}}
-    = \overrightarrow{\mathbf{w}}^{(t)}_{1} + \overrightarrow{\mathbf{w}}^{(t)}_{2} + \cdots + \overrightarrow{\mathbf{w}}^{(t)}_{T}
+    \overrightarrow{\mathbf{c}}_{t}
+    = \sum_{i}{\mathcal{O}^{(t)}_{i}}
+    = \overrightarrow{\mathcal{o}}^{(t)}_{1} + \overrightarrow{\mathcal{o}}^{(t)}_{2} + \cdots + \overrightarrow{\mathcal{o}}^{(t)}_{T}
     \end{aligned}$$
 
 - $t$ 시점 문맥 벡터와 $t$ 시점 입력 벡터 정보 종합
 
     $$
-    \overrightarrow{\mathbf{s}}_{t}
-    = \overrightarrow{\mathbf{z}}_{t} \oplus \hat{\mathbf{y}}_{t-1}
+    \overrightarrow{\mathbf{z}}_{t}
+    = \overrightarrow{\mathbf{c}}_{t} \oplus \hat{\mathbf{y}}_{t-1}
     $$
 
 -----