为什么不是 stemDocument 词干?
Why isn't stemDocument stemming?
我正在使用 R 中的 'tm' 包创建一个使用词干词条的文档矩阵。该过程即将完成,但生成的矩阵包含似乎未被阻止的术语,我正在尝试了解为什么会这样以及如何解决它。
这是该过程的脚本,它使用几个在线新闻故事作为沙箱:
library(boilerpipeR)
library(RCurl)
library(tm)
# Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl'
url <- "http://blogs.wsj.com/digits/2015/07/14/google-mozilla-disable-flash-over-security-concerns/"
extract <- LargestContentExtractor(getURL(url))
url2 <- "http://www.cnet.com/news/startup-lands-100-million-to-challenge-smartphone-superpowers-apple-and-google/"
extract2 <- LargestContentExtractor(getURL(url2))
# Now put those text vectors in a corpus and create a tdm
news.corpus <- VCorpus(VectorSource(c(extract, extract2)))
news.tdm <- TermDocumentMatrix(news.corpus,
control = list(removePunctuation = TRUE,
stopwords = TRUE,
stripWhitespace = TRUE,
stemDocument = TRUE))
# Now inspect the result
findFreqTerms(news, 4)
这是最后一行产生的输出:
[1] "acadine" "adobe" "android" "browser" "challenge" "companies" "company" "devices" "firefox" "flash"
[11] "funding" "gong" "hackers" "international" "ios" "like" "million" "mobile" "mozilla" "mozillas"
[21] "new" "online" "operating" "said" "security" "smartphones" "software" "startup" "system" "systems"
[31] "tsinghua" "unigroup" "used" "users" "videos" "web" "will"
例如,在第 1 行中,我们看到 "companies" 和 "company",我们看到 "devices"。我认为词干提取会将 "companies" 和 "company" 减少为相同的词干("compani"?),我认为它会 trim 和 "s" 一样的复数形式 "devices"。我错了吗?如果不是,为什么这段代码在这里没有产生预期的结果?
使用 stemming = TRUE
或 stemming = stemDocument
而不是 stemDocument = TRUE
。 (?termFreq
表明 stemDocument
不是有效的控制参数。)
我正在使用 R 中的 'tm' 包创建一个使用词干词条的文档矩阵。该过程即将完成,但生成的矩阵包含似乎未被阻止的术语,我正在尝试了解为什么会这样以及如何解决它。
这是该过程的脚本,它使用几个在线新闻故事作为沙箱:
library(boilerpipeR)
library(RCurl)
library(tm)
# Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl'
url <- "http://blogs.wsj.com/digits/2015/07/14/google-mozilla-disable-flash-over-security-concerns/"
extract <- LargestContentExtractor(getURL(url))
url2 <- "http://www.cnet.com/news/startup-lands-100-million-to-challenge-smartphone-superpowers-apple-and-google/"
extract2 <- LargestContentExtractor(getURL(url2))
# Now put those text vectors in a corpus and create a tdm
news.corpus <- VCorpus(VectorSource(c(extract, extract2)))
news.tdm <- TermDocumentMatrix(news.corpus,
control = list(removePunctuation = TRUE,
stopwords = TRUE,
stripWhitespace = TRUE,
stemDocument = TRUE))
# Now inspect the result
findFreqTerms(news, 4)
这是最后一行产生的输出:
[1] "acadine" "adobe" "android" "browser" "challenge" "companies" "company" "devices" "firefox" "flash"
[11] "funding" "gong" "hackers" "international" "ios" "like" "million" "mobile" "mozilla" "mozillas"
[21] "new" "online" "operating" "said" "security" "smartphones" "software" "startup" "system" "systems"
[31] "tsinghua" "unigroup" "used" "users" "videos" "web" "will"
例如,在第 1 行中,我们看到 "companies" 和 "company",我们看到 "devices"。我认为词干提取会将 "companies" 和 "company" 减少为相同的词干("compani"?),我认为它会 trim 和 "s" 一样的复数形式 "devices"。我错了吗?如果不是,为什么这段代码在这里没有产生预期的结果?
使用 stemming = TRUE
或 stemming = stemDocument
而不是 stemDocument = TRUE
。 (?termFreq
表明 stemDocument
不是有效的控制参数。)