Attention机制

回顾RNN结构

讲attention之前先回顾一下RNN的各种结构

N to N

如：语音处理，时间序列处理

N to 1

如：情感分析，输入一段视频判断类型

1 to N

或

如：从图像生成文字，从类别生成语音或音乐

N to M

这种就够又叫encoder-decoder模型，或Seq2Seq模型

或

如：机器翻译，文本摘要，阅读理解，语音识别……

回归正题Attention

在encoder-decoder结构中，显然当要处理的信息长度很长的时候，一个c存储不了那么多信息，导致处理精度下降
所以我们打算计算很多个c
- 每一个c会去选取和当前所要输出的y最合适的上下文信息。
  - 具体的，我们用$a_{ij}$来衡量encoder中第j阶段的$h_j$(hidden state)，和当前decoder中第i阶段的相关性
  - 以机器翻译为例
    - 上图标有红色的地方就是和decoder当前阶段最相关的地方，对应的值较大；其他的地方对应的值较小。这里就是attention的精髓所在了——每个decoder的状态对于每个encoder的状态分配注意力（当然，$\sum_{j=1}^{T_x}a_{ij}=1$）

  $$
  h_t=RNN_{enc}(x_t,h_{t-1})\\
  h'_t=RNN_{dec}{(c_t,h'_{t-1})}\\
  c_i=\sum_{j=1}^{T_x} a_{ij}h_j\qquad (T_x是x总长)
  $$

-

接下来就是求$a_{ij}$
- $$
  b_{ij}=score(h’{i-1},h_j)\qquad (我们的h’是从0开始的)\
  a{ij}=\frac{e^{b_{ij}}}{\sum e^{b_{ik}}}
  $$
那么我们的score是怎么计算的呢
- 最简单的方方法就是直接计算点乘，点积类似计算相似度。

而，attention极值就是来解决这个问题的，定义如下
- 给定一组向量集合value，以及一个向量集合query，attention机制就是根据query计算value的加权求和机制

主要也就一块代码

class AttDecoder_RNN(nn.Module):
    def __init__(self, word_num, hidden_dim, dropp=config.drop_p, max_length=config.max_length):
        super(AttDecoder_RNN, self).__init__()
        self.word_num = word_num
        self.hidden_dim = hidden_dim
        self.embed = nn.Embedding(word_num, hidden_dim)
        self.gru = nn.GRU(hidden_dim, hidden_dim)
        self.dropout = nn.Dropout(dropp)

        self.attn = nn.Linear(2 * hidden_dim, max_length)
        self.attn_C = nn.Linear(2 * hidden_dim, self.hidden_dim)

        self.out = nn.Linear(self.hidden_dim, self.word_num)

    def forward(self, encoder_state, input, hidden=None):
        batch_size = input.size(0)
        if hidden is None:
            hidden = t.rand(1, batch_size, self.hidden_dim)

        emb = self.embed(input)
        emb = self.dropout(emb)

        att_w = F.softmax(self.attn(t.cat((emb, hidden[0]), 1)), dim=1)  #先用上一层hidden与input拼接，然后通过网络映射到max_length得到权重a，得到权重再softmax
        att_c = t.bmm(att_w.unsqueeze(0).permute(1, 0, 2), encoder_state.permute(1, 0, 2))  #用权重a与encoder的hidden相乘得到c,(permute是pytorch调整维度)

        output = t.cat((emb, att_c.permute(1, 0, 2)[0]), 1)    #再把input与c拼接
        output = self.attn_C(output).unsqueeze(0)   #将长度映射还原，拼接之后hidden长度会加倍

        output, hidden = self.gru(output, hidden)
        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, att_w