R wordstem 切词太多

Question

我会举例说明：

library(data.table)
dt <- data.table(words = c("finance", "financial", "business"),
                  freq = c(123, 5, 4589))
dt <- dt[, words := SnowballC::wordStem(words, language = "english")]
View(dt)

words    freq
financ    123
financi    5
busi     4589

我认为词干提取会给我金融、金融和商业。我至少希望 finance 和 financial 具有相同的基本词。我试图对相似的词进行分组，它适用于某些词，例如 have 和 having both become 有，但是对于像上面这样的一些似乎不起作用，除非我误会了？

Answer 1

您的结果似乎就是 Porter 词干算法应该做的。

Documentation（第 4 步）显示了使用示例中使用的后缀进行词干提取的示例：

(m>1) AL -> revival -> reviv

(m>1) ANCE -> allowance -> allow

如果你想对你的词进行分组，那么你可能希望在运行 wordStem 之前将它们 trim 或者在词干提取之后使用字符串匹配函数（例如 agrep）。

R wordstem 切词太多

R wordstem chopping words too much

r

stemming

word

text-mining

data.table