为什么不是 stemDocument 词干？

Question

我正在使用 R 中的 'tm' 包创建一个使用词干词条的文档矩阵。该过程即将完成，但生成的矩阵包含似乎未被阻止的术语，我正在尝试了解为什么会这样以及如何解决它。

这是该过程的脚本，它使用几个在线新闻故事作为沙箱：

library(boilerpipeR)
library(RCurl)
library(tm)

# Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl'
url <- "http://blogs.wsj.com/digits/2015/07/14/google-mozilla-disable-flash-over-security-concerns/"
extract <- LargestContentExtractor(getURL(url))
url2 <- "http://www.cnet.com/news/startup-lands-100-million-to-challenge-smartphone-superpowers-apple-and-google/"
extract2 <- LargestContentExtractor(getURL(url2))

# Now put those text vectors in a corpus and create a tdm
news.corpus <- VCorpus(VectorSource(c(extract, extract2)))
news.tdm <- TermDocumentMatrix(news.corpus,
  control = list(removePunctuation = TRUE,
                 stopwords = TRUE,
                 stripWhitespace = TRUE,
                 stemDocument = TRUE))

# Now inspect the result
findFreqTerms(news, 4)

这是最后一行产生的输出：

[1] "acadine"       "adobe"         "android"       "browser"       "challenge"     "companies"     "company"       "devices"       "firefox"       "flash"        
[11] "funding"       "gong"          "hackers"       "international" "ios"           "like"          "million"       "mobile"        "mozilla"       "mozillas"     
[21] "new"           "online"        "operating"     "said"          "security"      "smartphones"   "software"      "startup"       "system"        "systems"      
[31] "tsinghua"      "unigroup"      "used"          "users"         "videos"        "web"           "will"

例如，在第 1 行中，我们看到 "companies" 和 "company"，我们看到 "devices"。我认为词干提取会将 "companies" 和 "company" 减少为相同的词干（"compani"？），我认为它会 trim 和 "s" 一样的复数形式 "devices"。我错了吗？如果不是，为什么这段代码在这里没有产生预期的结果？

Answer 1

使用 stemming = TRUE 或 stemming = stemDocument 而不是 stemDocument = TRUE。（?termFreq 表明 stemDocument 不是有效的控制参数。）

为什么不是 stemDocument 词干？

Why isn't stemDocument stemming?

nlp

r

text-mining

tm