使用 quanteda 进行词形还原

Question

如何使用 quanteda 对 makes 这样的词进行词形还原，使其成为 make。

在Python中可以使用NLTK WordNet Lemmatizer

Answer 1

可以使用 tokens_wordstem 或 dfm_wordstem 进行词干提取。但是词形还原需要用 tokens_replace 来完成。请注意 2 之间的区别，在词形还原中 "am" 被更改为 "be"，因为这是引理。

lexicon 包中有一个名为 hash_lemmas 的 table，您可以将其用作字典。 quanteda中没有默认的引理函数。

txt <- c("I am going to lemmatize makes into make, but not maker")

library(quanteda)

# stemming
tokens_wordstem(tokens(txt))
Tokens consisting of 1 document.
text1 :
 [1] "I"      "am"     "go"     "to"     "lemmat" "make"   "into"   "make"   ","      "but"    "not"    "maker" 

# lemmatizing using lemma table
tokens_replace(tokens(txt), pattern = lexicon::hash_lemmas$token, replacement = lexicon::hash_lemmas$lemma)
Tokens consisting of 1 document.
text1 :
 [1] "I"         "be"        "go"        "to"        "lemmatize" "make"      "into"      "make"      ","         "but"       "not"      
[12] "maker"

其他引理选项将 spacyr 与 quanteda 结合使用。请参阅 spacyr 教程。

或者您可以先使用 udpipe 获取引理，然后使用 quanteda 的 tokens_replace 或 dfm_replace 函数。

使用 quanteda 进行词形还原

Lemmatize using quanteda

r

quanteda