tm R 包中的 DocumentTermMatrix 的词频 table

Question

我正在使用 R 中的 tm 包进行一些文本挖掘。我有一个词频矩阵，其中每一行都是一个文档，每一列都是一个词，每个单元格都是这个词的频率。我正在尝试将其转换为 DocumentTermTermMatrix 对象。我似乎找不到处理该问题的功能。看起来来源通常是文档。

我试过 as.DocumentTermTermMatrix() 但它要求参数 "weighting" 并给出以下错误：

Error in .TermDocumentMatrix(t(x), weighting) :
argument "weighting" is missing, with no default

这是一个简单的可重现示例的代码

docs = matrix(sample(1:10, 50, replace=T), byrow = TRUE, ncol = 5, nrow=10) 
rownames(docs) = paste0("doc", 1:10)
colnames(docs) = c("grad", "school", "is", "sleep", "deprivation")

所以我需要将矩阵文档强制转换为 DocumentTermMatrix。

Answer 1

使用您的代码示例，您可以使用以下内容：

docs = matrix(sample(1:10, 50, replace=T), byrow = TRUE, ncol = 5, nrow=10) 
rownames(docs) = paste0("doc", 1:10)
colnames(docs) = c("grad", "school", "is", "sleep", "deprivation")

dtm <- as.DocumentTermMatrix(docs, weighting = weightTfIdf)

如果您阅读帮助 DocumentTermMatrix，您会在参数

下看到以下内容

weighting: A weighting function capable of handling a TermDocumentMatrix. It defaults to weightTf for term frequency weighting. Available weighting functions shipped with the tm package are weightTf, weightTfIdf, weightBin, and weightSMART.

根据您的需要，您必须指定用于文档术语矩阵的加权公式。或者自己创建一个。

tm R 包中的 DocumentTermMatrix 的词频 table

Term frequency table to DocumentTermMatrix in tm R package

r

text-mining

tm

word-frequency