出现在数据框列文本中的单词列及其在 R 中的频率

Question

我有一个关于这个旧 post 的问题：R Text mining - how to change texts in R data frame column into several columns with word frequencies?

我正在尝试模仿与上面 link 中 posted 完全相似的东西，使用 R，但是，字符串包含数字字符。

假设 res 是我定义的数据框：

library(qdap)
x1 <- as.factor(c( "7317 test1 fool 4258 6287" , "thi1s is 6287 test funny text1 test1", "this is test1 6287 text1 funny fool"))
y1 <- as.factor(c("test2 6287", "this is test text2", "test2 6287"))
z1 <- as.factor(c( "test2 6287" , "this is test 4258 text2 fool", "test2 6287"))
res <- data.frame(x1, y1, z1)

当我计算使用这些命令定义的单词的频率时，

freqs <- t(wfm(as.factor(res$x1), 1:nrow(res), char.keep=TRUE))
abcd <- data.frame(res, freqs, check.names = FALSE)

abcd 忽略 7317、4258、6287 甚至 test1 中的数字 1 并计算频率。

在x1列的第一行，从test1中去掉1并算作一个单词。类似地，is 从 thi1s 中剥离并算作一个单词。但是，我想要的是test1。类似地，存储为字符串的字符串 7317、4258 等必须计为单词，并以它们的频率出现在数据 table 中。代码中必须额外容纳什么？

Answer 1

您需要在 freqs 语句中添加以下内容：removeNumbers = FALSE。 wfm 函数调用了其他几个函数，其中之一是 tm::TermDocumentMatrix。在这里，wfm 为该函数提供的默认值是 removeNumbers = TRUE。所以这个需要设置成FALSE。

代码：

freqs <- t(wfm(as.factor(res$x1), 1:nrow(res), char.keep=TRUE, removeNumbers = FALSE))
abcd <- data.frame(res, freqs, check.names = FALSE)

出现在数据框列文本中的单词列及其在 R 中的频率

word columns appearing in text froma data frame column with their freuency in R

text

r

count

word

mining