<EOS> 和 <BOS> 标签是否应该在使用 keras.preprocessing.text Tokenizer 后显式添加到词汇表中？

Question

在 Keras 中，我们有 keras.preprocessing.text 来标记我们要求的文本并生成词汇表。

tokenizer = tf.keras.preprocessing.text.Tokenizer(split=' ',  oov_token=1)
tokenizer.fit_on_texts(["Hello world"])
seqs = tokenizer.texts_to_sequences(["Hello world"])

我不确定的是，如果我们将生成的 seqs 提供给神经网络，是否明确添加序列结束 (EOS) 标签和序列开始 (BOS) 标签将 seq 填充到固定长度后，像 RNN 这样的网络。或者，Keras 会为我们做这件事吗？（我还没有看到任何使用 Keras tokenizer 时显式添加 EOS 和 BOS 的例子）

Answer 1

不，不需要为 tf.keras.preprocessing.text.Tokenizer
添加 <EOS> <BOS> 由于 index_word 映射按从 oov_token 开始的顺序工作，因此下一个偏好是针对频率最高的词，然后是与输入顺序相同的词。这有助于 Keras API 在内部处理映射，这与使用 <START> 和 <END> 标签的其他文本预处理 API 不同。

下面是示例和示例，以显示 index_word 映射。

text_data = ["this is the sample sentence",
            "one more sentence"]

lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token="<UNK>")
lang_tokenizer.fit_on_texts(text_data)
lang_tokenizer.index_word

index_word:

{1: '<UNK>',
 2: 'sentence',
 3: 'this',
 4: 'is',
 5: 'the',
 6: 'sample',
 7: 'one',
 8: 'more'}

测试：

res = lang_tokenizer.texts_to_sequences(["testing with sample sentence"])

[[1, 1, 6, 2]]

希望这能回答您的问题，祝您学习愉快！

<EOS> 和 <BOS> 标签是否应该在使用 keras.preprocessing.text Tokenizer 后显式添加到词汇表中？

Should <EOS> and <BOS> tags be explictly added to vocabulary after using keras.preprocessing.text Tokenizer?

python

vocabulary

keras

tensorflow

recurrent-neural-network