在术语文档矩阵上使用 lapply 来计算词频

Question

给定三个 TermDocumentMatrix、text1、text2 和 text3，我想将它们中的每一个的词频计算到一个数据框中，然后 rbind 所有数据框。三个是示例 - 我实际上有数百个，所以我需要将其功能化。

计算一个 TDM 的词频很容易：

apply(x, 1, sum)

或

rowSums(as.matrix(x))

我想列出 TDM：

tdm_list <- Filter(function(x) is(x, "TermDocumentMatrix"), mget(ls()))

并计算每个单词的频率并将其放入数据框中：

data.frame(lapply(tdm_list, sum)) # this is wrong. it simply sums frequency of all words instead of frequency by each word.

然后全部绑定：

do.call(rbind, df_list)

我不知道如何在 TDM 上使用 lapply 来计算词频。

添加示例数据以进行试验：

require(tm)
text1 <- c("apple" , "love", "crazy", "peaches", "cool", "coke", "batman", "joker")
text2 <- c("omg", "#rstats" , "crazy", "cool", "bananas", "functions", "apple")
text3 <- c("Playing", "rstats", "football", "data", "coke", "caffeine", "peaches", "cool")

tdm1 <- TermDocumentMatrix(Corpus(VectorSource(text1)))
tdm2 <- TermDocumentMatrix(Corpus(VectorSource(text2)))
tdm3 <- TermDocumentMatrix(Corpus(VectorSource(text3)))

Answer 1

好的，我想我有它，这实际上可能会帮助那些想做同样事情的人。最后很简单。

combineddf <- do.call(rbind, lapply(tdm_list, function (x) {
 data.frame(apply(x, 1, sum))
}))

上面列出了一个 TermDocumentMatrices 列表，并在数据框中给出了所有这些矩阵的字数，并对所有内容进行了 rbinds。

在术语文档矩阵上使用 lapply 来计算词频

Using lapply on term document matrix to calculate word frequency

r

lapply

term-document-matrix