StanfordNLP 词形还原无法处理 -ing 词

Question

我一直在试验 Stanford NLP 工具包及其词形还原功能。我很惊讶它如何使一些词词形还原。例如：

depressing -> depressing
depressed -> depressed
depresses -> depress

无法将 depressing 和 depressed 转换为相同的引理。 confusing 和 confused、hopelessly 和 hopeless 也会发生类似情况。我感觉它唯一能做的就是删除 s 如果这个词是这种形式（例如 feels -> feel）。这种行为对于英语中的 Lematizatiors 来说是正常的吗？我希望他们能够将这些常用词的变体转换成相同的引理。

如果这是正常的，我应该使用词干分析器吗？而且，有没有办法在 StanfordNLP 中使用像 Porter（Snowball 等）这样的词干分析器？他们的文档中没有提到词干分析器；但是，API 中有一些 CoreAnnotations.StemAnnotation。如果不能使用 StanfordNLP，您推荐在 Java?

中使用哪些词干提取器

Answer 1

词形还原主要取决于标记的词性。只有具有相同词性的标记才会映射到相同的引理。

在句子"This is confusing"中，confusing被解析为形容词，因此词形还原为confusing。在句子"I was confusing you with someone else"中，相比之下，confusing被分析为一个动词，并被词形化为confuse。

如果您希望将具有不同词性的标记映射到相同的词条，您可以使用词干提取算法，例如 Porter Stemming，您可以简单地调用每个标记。

Answer 2

添加到 yvespeirsman 的回答：

我看到，在应用词形还原时，我们应该确保文本保留其标点符号，也就是说，标点符号删除必须在词形还原之前进行，因为词形还原器考虑到了执行任务时单词（词性）的类型。

请注意下面示例中的单词 confuse 和 confusing。

带标点符号：

for token in nlp("This is confusing. You are confusing me."):
   print(token.lemma_)

输出：

this
be
confusing
.
-PRON-
be
confuse
-PRON-
.

没有标点符号：

for token in nlp("This is confusing You are confusing me"):
   print(token.lemma_)

输出：

this
be
confuse
-PRON-
be
confuse
-PRON-

StanfordNLP 词形还原无法处理 -ing 词

StanfordNLP lemmatization cannot handle -ing words

java

nlp

stemming

lemmatization

stanford-nlp