r 中的词干词:缺失值
Stemming Words in r: Missing Value
我正在尝试对推文进行情绪分析。在进行单词预处理和创建矩阵时,出现以下错误:
Error in if (any(lens > lim)) stop("There is a limit of ", lim, "characters on the number of characters in a word being stemmed") :
missing value where TRUE/FALSE needed
从 14215 条推文中,我将其归结为产生错误的特定推文,但不知道如何防止此错误再次发生。
导致错误的推文是(以及重现错误的代码):
library(RTextTools)
tweet<-"demonio leg edge sexy we get it u vape PLEASE COME TO NA SOON I HAVE A LUCIEL READY FOR U dominos"
all_tweets= create_matrix(tweet, language="english", minWordLength = 3,
removeStopwords=TRUE, removeNumbers=TRUE, # we can also removeSparseTerms
stemWords=TRUE,removePunctuation = TRUE,removeSparseTerms = 0)
我首先想了解这个错误 - 为什么会发生,然后我想要的是一种能够防止这个错误发生的方法 - 通过选择和删除此类推文或编辑我的 create_matrix 以这种方式运行?
错误来自执行
wordStem(
c("demonio", "leg", "edge", "sexy",
"get", "u", "vape", "please",
"come", NA, "soon", "luciel",
"ready", "u", "dominos")
)
# Error in if (any(lens > lim)) stop("There is a limit of ", lim, "characters on the number of characters in a word being stemmed") :
# missing value where TRUE/FALSE needed
也许这是一个错误。字符串 "NA" 似乎被标记化为 NA
(缺失值)。
作为解决方法,使用
library(tm)
all_tweets <- DocumentTermMatrix(
Corpus(VectorSource(tweet)),
control = list(
wordLengths = c(3, Inf),
stopwords=TRUE,
removeNumbers=TRUE,
stemming=TRUE,
removePunctuation = TRUE
)
)
我的sessionInfo()
:
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RTextTools_1.4.2 SparseM_1.7
loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 splines_3.3.0 MASS_7.3-44 tau_0.0-18 prodlim_1.5.5 tm_0.6-2
[7] lattice_0.20-33 foreach_1.4.3 caTools_1.17.1 tools_3.3.0 nnet_7.3-11 parallel_3.3.0
[13] grid_3.3.0 ipred_0.9-5 glmnet_2.0-5 e1071_1.6-7 iterators_1.0.8 class_7.3-14
[19] survival_2.39-4 randomForest_4.6-12 Matrix_1.2-6 NLP_0.1-9 lava_1.4.3 bitops_1.0-6
[25] codetools_0.2-14 rsconnect_0.4.3 maxent_1.3.3.1 rpart_4.1-10 slam_0.1-32 tree_1.0-36
我正在尝试对推文进行情绪分析。在进行单词预处理和创建矩阵时,出现以下错误:
Error in if (any(lens > lim)) stop("There is a limit of ", lim, "characters on the number of characters in a word being stemmed") :
missing value where TRUE/FALSE needed
从 14215 条推文中,我将其归结为产生错误的特定推文,但不知道如何防止此错误再次发生。 导致错误的推文是(以及重现错误的代码):
library(RTextTools)
tweet<-"demonio leg edge sexy we get it u vape PLEASE COME TO NA SOON I HAVE A LUCIEL READY FOR U dominos"
all_tweets= create_matrix(tweet, language="english", minWordLength = 3,
removeStopwords=TRUE, removeNumbers=TRUE, # we can also removeSparseTerms
stemWords=TRUE,removePunctuation = TRUE,removeSparseTerms = 0)
我首先想了解这个错误 - 为什么会发生,然后我想要的是一种能够防止这个错误发生的方法 - 通过选择和删除此类推文或编辑我的 create_matrix 以这种方式运行?
错误来自执行
wordStem(
c("demonio", "leg", "edge", "sexy",
"get", "u", "vape", "please",
"come", NA, "soon", "luciel",
"ready", "u", "dominos")
)
# Error in if (any(lens > lim)) stop("There is a limit of ", lim, "characters on the number of characters in a word being stemmed") :
# missing value where TRUE/FALSE needed
也许这是一个错误。字符串 "NA" 似乎被标记化为 NA
(缺失值)。
作为解决方法,使用
library(tm)
all_tweets <- DocumentTermMatrix(
Corpus(VectorSource(tweet)),
control = list(
wordLengths = c(3, Inf),
stopwords=TRUE,
removeNumbers=TRUE,
stemming=TRUE,
removePunctuation = TRUE
)
)
我的sessionInfo()
:
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RTextTools_1.4.2 SparseM_1.7
loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 splines_3.3.0 MASS_7.3-44 tau_0.0-18 prodlim_1.5.5 tm_0.6-2
[7] lattice_0.20-33 foreach_1.4.3 caTools_1.17.1 tools_3.3.0 nnet_7.3-11 parallel_3.3.0
[13] grid_3.3.0 ipred_0.9-5 glmnet_2.0-5 e1071_1.6-7 iterators_1.0.8 class_7.3-14
[19] survival_2.39-4 randomForest_4.6-12 Matrix_1.2-6 NLP_0.1-9 lava_1.4.3 bitops_1.0-6
[25] codetools_0.2-14 rsconnect_0.4.3 maxent_1.3.3.1 rpart_4.1-10 slam_0.1-32 tree_1.0-36