如何使用 Keras 构建词性标注器？

Question

我正在尝试在 Keras 的帮助下使用神经网络实现词性标注器。

我正在使用顺序模型，并从 NLTK 的 Penn Treebank 语料库中训练数据（即来自 nltk.corpus import treebank）。按照我的理解，要和Keras组成一个神经网络包括以下步骤：

加载数据
定义 -> 编译 -> 拟合模型
评估模型

具体来说，我不确定如何预处理标记的训练数据以便在我的模型中使用它？这些标记数据来自 nltk的语料库，都是键值对，key是英文单词， value 是相应的 POS 标记。

准确地说，我不知道如何在下面的代码中"data"和"labels"变量中排列数据：

model.fit(data, labels, nb_epoch=50, batch_size=32)

有人可以给我一些提示吗？非常感谢您的宝贵时间，非常感谢您的帮助！

Answer 1

如何执行此操作有很多变体，它们取决于您拥有的数据量和您要为此投入的时间。我将尝试为您提供主流路径，您可以在引用一些替代方案的同时改进自己。我不会假设具有深度学习文本建模的先验知识。

一种方法是将问题建模为多class class化，其中classes/label 类型都是可能的词性标记。有两种最常见的方法来构建深度学习模型：一种是 window 模型。另一个是使用循环单元的序列标注器。

让我们假设两者中最简单的一个，window 模型。然后您可以执行以下操作：

构建数据

将你的语料库分成 windows 个 W 个词（例如 3 个词），其中中心词是你想要 class 化的词，其他词是上下文。我们称这部分数据为X.
对于每个 window，获取中心词的词性标记。我们称这部分数据为y

编码数据

将 X 编码为向量

现在神经网络需要 X 编码为向量序列。一个常见的选择是将每个词编码为词嵌入。

为此，首先将文本标记化并将每个单词编码为整数单词 ID（例如，"cat" 的每次出现都将是数字 7）。如果你没有自己的分词器，你可以使用 the one bundled with Keras。这需要文本和 returns 序列 integers/word ids.

其次，您可能想要填充和截断单词 ID 的每个序列，以便每个实例都具有相同的长度（注意：还有其他处理方法）。一个 example from the imdb_lstm.py 是

(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

然后您可以使用嵌入层将 padded/truncated 词 id 序列转换为词嵌入序列。来自 imdb_lstm.py 的示例：

model = Sequential()
model.add(Embedding(max_features, 128, dropout=0.2))
model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))  # try using a GRU instead, for fun

此处嵌入的输出被用于馈送到 LSTM。我在最后列出了其他型号选项。

编码 y

要用 Keras 进行多 class class 化，通常使用 categorical_crossentropy，它期望标签是一个单热向量，只要可能的类别数（您的情况下可能的 POS 标签数量）。您可以使用 keras 的 to_categorical。请注意，它需要一个整数向量，其中每个整数代表一个 class（例如 NNP 可以是 0，VBD 可以是 1 等等）：

def to_categorical(y, nb_classes=None):
    '''Convert class vector (integers from 0 to nb_classes) to binary class matrix, for use with categorical_crossentropy.
    # Arguments
        y: class vector to be converted into a matrix
        nb_classes: total number of classes
    # Returns
        A binary matrix representation of the input.
    '''

模型选项

因为在这一行解决方案中，您基本上会进行多class class化，您基本上可以将其视为imdb_ 遵循 keras examples. These are actually binary text classification examples. To make them multi-class you need to use a softmax instead of a sigmoid as the final activation function and categorical_crossentropy instead of binary_crossentropy like in the mnist_ examples:

中的任何示例

model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

如何使用 Keras 构建词性标注器？

How to use Keras to build a Part-of-Speech tagger?

part-of-speech

pos-tagger

neural-network

deep-learning

keras

构建数据

编码数据

将 X 编码为向量

编码 y

模型选项