R tm包升级 - 将语料库转换为数据框时出错
R tm package Upgrade - Error in converting corpus to data frame
最新的 tm 升级似乎出了点问题。我的代码如下,带有测试数据 -
data = c('Lorem ipsum dolor sit amet, consectetur adipiscing elit',
'Vestibulum posuere nisl vel lobortis vulputate',
'Quisque eget sem in felis egestas sagittis')
ccorpus_clean = Corpus(VectorSource((data)))
ccorpus_clean = tm_map(ccorpus_clean,removePunctuation,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,stripWhitespace,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,tolower,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeNumbers,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,stemDocument,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeWords,stopwords("english"),lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeWords,c("hi"),lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeWords,c("account","can"),lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,PlainTextDocument,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,stripWhitespace,lazy=TRUE);
ccorpus_clean;
df = data.frame(text=unlist(sapply(ccorpus_clean , `[[`, "content")), stringsAsFactors=FALSE)
早些时候一切正常。但突然我需要使用 ",lazy=TRUE"。否则语料库转换将停止工作。
懒惰的问题记录在这里 - R tm In mclapply(content(x), FUN, ...) : all scheduled cores encountered errors in user code
使用 Lazy,转换有效,但语料库转换回数据帧停止并出现以下错误 -
ccorpus_clean = tm_map(ccorpus_clean,stripWhitespace,lazy=TRUE)
ccorpus_clean
<>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 5
df = data.frame(text=unlist(sapply(ccorpus_clean , `[[`, "content")), stringsAsFactors=FALSE)
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "try-error"
In addition: Warning message:
In mclapply(x$content[i], function(d) tm_reduce(d, x$lazy$maps)) :
all scheduled cores encountered errors in user code
编辑 - 这也失败了
data.frame(text = sapply(ccorpus_clean, as.character), stringsAsFactors = FALSE)
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "try-error"
R 版本 - version.stringR 版本 3.2.3 (2015-12-10) /
tm - 0.6-2
看起来很复杂。怎么样:
data <- c("Lorem ipsum dolor sit amet account: 999 red balloons.",
"Some English words are just made for stemming!")
require(quanteda)
# makes the texts into a list of tokens with the same treatment
# as your tm mapped functions
toks <- tokenize(toLower(data), removePunct = TRUE, removeNumbers = TRUE)
# toks is just a named list
toks
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "lorem" "ipsum" "dolor" "sit" "amet" "account" "red" "balloons"
##
## Component 2 :
## [1] "some" "english" "words" "are" "just" "made" "for" "stemming"
# remove selected terms
toks <- removeFeatures(toks, c(stopwords("english"), "hi", "account", "can"))
# apply stemming
toks <- wordstem(toks)
# make into a data frame by reassembling the cleaned tokens
(df <- data.frame(text = sapply(toks, paste, collapse = " ")))
## text
## 1 lorem ipsum dolor sit amet red balloon
## 2 english word just made stem
我也遇到过类似的问题,看来不是升级tm包引起的。如果您不想使用 quanteda,则另一种替代解决方案是将核心数设置为 1(而不是执行 Lazy = TRUE)。不知道为什么,但它对我有用。
corpus = tm_map(corpus, tolower, mc.cores = 1)
如果您有兴趣诊断此问题是否由并行计算问题引起,请尝试输入此行
getOption("mc.cores", 2L)
如果是returns2核,那么设置核数为1即可解决问题。有关详细说明,请参阅 this answer。
最新的 tm 升级似乎出了点问题。我的代码如下,带有测试数据 -
data = c('Lorem ipsum dolor sit amet, consectetur adipiscing elit',
'Vestibulum posuere nisl vel lobortis vulputate',
'Quisque eget sem in felis egestas sagittis')
ccorpus_clean = Corpus(VectorSource((data)))
ccorpus_clean = tm_map(ccorpus_clean,removePunctuation,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,stripWhitespace,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,tolower,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeNumbers,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,stemDocument,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeWords,stopwords("english"),lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeWords,c("hi"),lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,removeWords,c("account","can"),lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,PlainTextDocument,lazy=TRUE)
ccorpus_clean = tm_map(ccorpus_clean,stripWhitespace,lazy=TRUE);
ccorpus_clean;
df = data.frame(text=unlist(sapply(ccorpus_clean , `[[`, "content")), stringsAsFactors=FALSE)
早些时候一切正常。但突然我需要使用 ",lazy=TRUE"。否则语料库转换将停止工作。 懒惰的问题记录在这里 - R tm In mclapply(content(x), FUN, ...) : all scheduled cores encountered errors in user code
使用 Lazy,转换有效,但语料库转换回数据帧停止并出现以下错误 -
ccorpus_clean = tm_map(ccorpus_clean,stripWhitespace,lazy=TRUE)
ccorpus_clean
<>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 5
df = data.frame(text=unlist(sapply(ccorpus_clean , `[[`, "content")), stringsAsFactors=FALSE)
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "try-error"
In addition: Warning message:
In mclapply(x$content[i], function(d) tm_reduce(d, x$lazy$maps)) :
all scheduled cores encountered errors in user code
编辑 - 这也失败了
data.frame(text = sapply(ccorpus_clean, as.character), stringsAsFactors = FALSE)
Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "try-error"
R 版本 - version.stringR 版本 3.2.3 (2015-12-10) / tm - 0.6-2
看起来很复杂。怎么样:
data <- c("Lorem ipsum dolor sit amet account: 999 red balloons.",
"Some English words are just made for stemming!")
require(quanteda)
# makes the texts into a list of tokens with the same treatment
# as your tm mapped functions
toks <- tokenize(toLower(data), removePunct = TRUE, removeNumbers = TRUE)
# toks is just a named list
toks
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "lorem" "ipsum" "dolor" "sit" "amet" "account" "red" "balloons"
##
## Component 2 :
## [1] "some" "english" "words" "are" "just" "made" "for" "stemming"
# remove selected terms
toks <- removeFeatures(toks, c(stopwords("english"), "hi", "account", "can"))
# apply stemming
toks <- wordstem(toks)
# make into a data frame by reassembling the cleaned tokens
(df <- data.frame(text = sapply(toks, paste, collapse = " ")))
## text
## 1 lorem ipsum dolor sit amet red balloon
## 2 english word just made stem
我也遇到过类似的问题,看来不是升级tm包引起的。如果您不想使用 quanteda,则另一种替代解决方案是将核心数设置为 1(而不是执行 Lazy = TRUE)。不知道为什么,但它对我有用。
corpus = tm_map(corpus, tolower, mc.cores = 1)
如果您有兴趣诊断此问题是否由并行计算问题引起,请尝试输入此行
getOption("mc.cores", 2L)
如果是returns2核,那么设置核数为1即可解决问题。有关详细说明,请参阅 this answer。