了解 ELMo 的演示次数

Understanding ELMo's number of presentations

我正在通过简单地将它用作更大的 PyTorch 模型的一部分来尝试 ELMo。给出了一个基本示例 here.

This is a torch.nn.Module subclass that computes any number of ELMo representations and introduces trainable scalar weights for each. For example, this code snippet computes two layers of representations (as in the SNLI and SQuAD models from our paper):

from allennlp.modules.elmo import Elmo, batch_to_ids

options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"

# Compute two different representation for each token.
# Each representation is a linear weighted combination for the
# 3 layers in ELMo (i.e., charcnn, the outputs of the two BiLSTM))
elmo = Elmo(options_file, weight_file, 2, dropout=0)

# use batch_to_ids to convert sentences to character ids
sentences = [['First', 'sentence', '.'], ['Another', '.']]
character_ids = batch_to_ids(sentences)

embeddings = elmo(character_ids)

# embeddings['elmo_representations'] is length two list of tensors.
# Each element contains one layer of ELMo representations with shape
# (2, 3, 1024).
#   2    - the batch size
#   3    - the sequence length of the batch
#   1024 - the length of each ELMo vector

我的问题涉及 'representations'。你能将它们与普通的 word2vec 输出层进行比较吗?您可以选择 many ELMo 将如何回馈(增加第 n 个维度),但这些生成的表示之间有什么区别,它们的典型用途是什么?

给你一个想法,对于上面的代码,embeddings['elmo_representations'] returns 两个项目的列表(两个表示层)但它们是相同的。

简而言之,如何定义ELMo中的'representations'?

参见 the original paper 的第 3.2 节。

ELMo is a task specific combination of the intermediate layer representations in the biLM. For each token, a L-layer biLM computes a set of 2L+ 1representations

前面3.1节中说:

Recent state-of-the-art neural language models compute a context-independent token representation (via token embeddings or a CNN over characters) then pass it through L layers of forward LSTMs. At each position k, each LSTM layer outputs a context-dependent representation. The top layer LSTM output is used to predict the next token with a Softmax layer.

为了回答你的问题,表示是这些 L LSTM-based context-dependent 表示。