使用 VCorpus() 函数但丢失内容

Question

我正在使用 r 包 tm 中的 VCorpus() 函数。这是我遇到的问题

example_text = data.frame(num=c(1,2,3),Author1 = c("Text mining is a great time.","Text analysis provides insights","qdap and tm are used in text mining"),Author2=c("R is a great language","R has many uses","DataCamp is cool!"))

这看起来像

num                             Author1               Author2
1   1        Text mining is a great time. R is a great language
2   2     Text analysis provides insights       R has many uses
3   3 qdap and tm are used in text mining     here is a problem

然后我输入 df_source = DataframeSource(example_text[,2:3]) 以仅提取最后 2 列。

df_source 看起来是正确的。在那之后，我做了 df_corpus = VCorpus(df_source) 并且 df_corpus[[1]] 是

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 2

而 df_corpus[[1]] 给了我

$content
[1] "3" "3"

但是df_corpus[[1]]应该return

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 49

而df_corpus[[1]][1]应该return

$content
[1] "Text mining is a great time." "R is a great language"

不知道哪里出了问题。任何建议将不胜感激。

Answer 1

example_text里面原本应该是字符的文字都变成了因子，因为stringsAsFactors的'factory-fresh'值为TRUE，这很奇怪，很烦人我的观点。

example_text <- data.frame(num=c(1,2,3),Author1 = c("Text mining is a great time.","Text analysis provides insights","qdap and tm are used in text mining"),Author2=c("R is a great language","R has many uses","DataCamp is cool!"))
lapply(example_text, class)

# $num
# [1] "numeric"
# 
# $Author1
# [1] "factor"
# 
# $Author2
# [1] "factor"

要确保Author1和Author2列为字符列，您可以尝试：

在您的代码开头添加 options(stringsAsFactors = FALSE)。
在您的 data.frame(...) 语句中添加 stringsAsFactors = FALSE。
运行 example_text[, 2:3] <- lapply(example_text[, 2:3], as.character)
运行 example_text[, 2:3] <- lapply(example_text[, 2:3], paste)

然后一切都应该正常工作。

使用 VCorpus() 函数但丢失内容

Using VCorpus() function but lose content

r

tm