文本词干提取后词频不准确
Word frequency not accurate after text stemming
感谢您花时间阅读我的 post。这里是新手,这是我的第一个带有一些示例数据的 R 脚本。
library(tm)
library(hunspell)
library(stringr)
docs <- VCorpus(VectorSource('He is a nice player, She could be a better player. Playing basketball is fun. Well played! We could have played better. Wish we had better players!'))
input <- strsplit(as.character(docs), " ")
input <- unlist(input)
input <- hunspell_stem(input)
input <- word(input,-1)
input <- VCorpus(VectorSource(input))
docs <- input
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
sort(rowSums(m),decreasing=TRUE)
这returns结果如下:
character0 48 better 3 play 3 basketball 1 description 1
fun 1 head 1 hour 1 language 1 meta 1 min 1 nice 1
origin 1 well 1 wish 1 year 1
预期结果:
better 3 play 3 basketball 1 fun 1 language 1 nice 1 well 1
wish 1
不确定这些词的来源(character0、描述、元、语言等)以及是否有办法摆脱它们?
基本上我想做的是用 hunspell 在语料库(数据源 sql 服务器 table)上应用词干提取,然后在词云中显示它们。任何帮助,将不胜感激。 GD
这就是您评论中的示例失败的原因:
library(tm)
library(hunspell)
hunspell_stem(strsplit('Thanks lukeA for your help!', "\W")[[1]])
# [[1]]
# [1] "thank"
#
# [[2]]
# character(0)
#
# [[3]]
# [1] "for"
#
# [[4]]
# [1] "your"
#
# [[5]]
# [1] "help"
这里有一种方法可以让它发挥作用:
docs <- VCorpus(VectorSource('Thanks lukeA for your help!'))
myStem <- function(x) {
res <- hunspell_stem(x)
idx <- which(lengths(res)==0)
if (length(idx)>0)
res[idx] <- x[idx]
sapply(res, tail, 1)
}
dtm <- TermDocumentMatrix(docs, control = list(stemming = myStem))
m <- as.matrix(dtm)
sort(rowSums(m),decreasing=TRUE)
# for help! lukea thank your
# 1 1 1 1 1
如果没有词干,这将是 return 原始令牌,如果有多个词干,则将是最后一个词干。
感谢您花时间阅读我的 post。这里是新手,这是我的第一个带有一些示例数据的 R 脚本。
library(tm)
library(hunspell)
library(stringr)
docs <- VCorpus(VectorSource('He is a nice player, She could be a better player. Playing basketball is fun. Well played! We could have played better. Wish we had better players!'))
input <- strsplit(as.character(docs), " ")
input <- unlist(input)
input <- hunspell_stem(input)
input <- word(input,-1)
input <- VCorpus(VectorSource(input))
docs <- input
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
sort(rowSums(m),decreasing=TRUE)
这returns结果如下:
character0 48 better 3 play 3 basketball 1 description 1
fun 1 head 1 hour 1 language 1 meta 1 min 1 nice 1 origin 1 well 1 wish 1 year 1
预期结果:
better 3 play 3 basketball 1 fun 1 language 1 nice 1 well 1 wish 1
不确定这些词的来源(character0、描述、元、语言等)以及是否有办法摆脱它们?
基本上我想做的是用 hunspell 在语料库(数据源 sql 服务器 table)上应用词干提取,然后在词云中显示它们。任何帮助,将不胜感激。 GD
这就是您评论中的示例失败的原因:
library(tm)
library(hunspell)
hunspell_stem(strsplit('Thanks lukeA for your help!', "\W")[[1]])
# [[1]]
# [1] "thank"
#
# [[2]]
# character(0)
#
# [[3]]
# [1] "for"
#
# [[4]]
# [1] "your"
#
# [[5]]
# [1] "help"
这里有一种方法可以让它发挥作用:
docs <- VCorpus(VectorSource('Thanks lukeA for your help!'))
myStem <- function(x) {
res <- hunspell_stem(x)
idx <- which(lengths(res)==0)
if (length(idx)>0)
res[idx] <- x[idx]
sapply(res, tail, 1)
}
dtm <- TermDocumentMatrix(docs, control = list(stemming = myStem))
m <- as.matrix(dtm)
sort(rowSums(m),decreasing=TRUE)
# for help! lukea thank your
# 1 1 1 1 1
如果没有词干,这将是 return 原始令牌,如果有多个词干,则将是最后一个词干。