keras-tokenizer 是否执行词形还原和词干提取的任务？

Question

keras tokenizer是否提供词干提取、词形还原等功能？如果是，那么它是如何完成的？需要直观的理解。另外，text_to_sequence 是做什么的？

Answer 1

可能有些混淆 tokenizer 分别做什么 tokenization 是什么。标记化将字符串拆分为更小的实体，例如单词或单个字符。因此，这些也被称为代币。 Wikipedia 提供了一个很好的例子：

The quick brown fox jumps over the lazy dog 变为：

<sentence>
  <word>The</word>
  <word>quick</word>
  ...
  <word>dog</word>
</sentence>

词形还原（将单词的变形形式组合在一起 -> link) or stemming (process of reducing inflected (or sometimes derived) words to their word stem -> link）是您在预处理过程中所做的事情。标记化可以是词形还原和词干提取之前或之后（或两者）预处理过程的一部分。

总之，Keras 并不是一个完全可以实现的框架text-preprocessing。因此，您将已经清理、词形还原等数据输入 Keras。 关于您的第一个问题：不，Keras 不提供词形还原或词干提取等功能。

Keras 在 文本预处理 下的理解，如 here in the docs is the functionallity to prepare data in order to be fed to a Keras-model (like a Sequential model). This is for example why the Keras-Tokenizer 是这样做的：

This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

例如，通过矢量化输入字符串并将它们转换为数字数据，您可以将它们作为输入提供给神经网络（如果是 Keras）。

text_to_sequence 的含义可以从中提取：[...]整数序列（每个整数都是字典中标记的索引）[...]。这意味着您以前的字符串之后可以是数字整数序列（例如数组）而不是实际单词。

关于这一点，您还应该看看什么是 Keras 序列模型（例如 here），因为它们将序列作为输入。

此外，text_to_word_sequence() (docs) 也提供这种标记化，但不会将您的数据向量化为数字向量和 returns 标记化字符串数组。

Converts a text to a sequence of words (or tokens).

keras-tokenizer 是否执行词形还原和词干提取的任务？

Does keras-tokenizer perform the task of lemmatization and stemming?

nlp

stemming

tokenize

lemmatization

keras