Attention Model

In sequence to sequence model, we first encoded the input sequence and decoded it. For machine translation task, it is like memorizing the whole sentence before translating it. But actually it is more natural to read some part of the sentence, translate it, read next part, translate, and so on.

Attention model was invented with this intuition.

Attention model

We put ‘decoding’ RNN on top of the ‘encoding’ RNN. Activation $s^{<t>}$ in the decoding RNN is inputted with $x^{<t>}$ and $c^{<t>}$.

This $c^{<t>}$ is computed with attention weight $\alpha^{<t,t’>}$ . Attention weight $\alpha^{<t,t’>}$ is the amount of attention $\hat{y}^{<t>}$ should pay to $a^{<t>}$ .

I’ll give precise formulas of attention model works.

1. Combine activations of BRNN

2. With activation from the encoding BRNN and previous state of decoding RNN, compute $e$

$W_e$ and $b_e$ are a trainable parameter so will be optimized by gradient descent.

3. Softmax $e$ to compute $\alpha$ , so that sum of $\alpha$ s of one translated word equals 1.

4. Compute $c^{<t>}$ , which will act as $s^{<t-1>}$ .

5. Compute normal RNN layer

where $\hat{y}_{class}^{<t-1>}$ is a predicted class one-hot representation of $\hat{y}^{<t-1>}$

Leave a Comment