如何将 Dataframe 转换为 DTM

Question

我想将我的 table 转换为 DTM 并维护元数据。

每一行应该是一个文档。但是为了使用 cast_dtm()，需要有一个计数变量。为了 "cast"，它需要采用 "Document, Term, Count" 格式。

如何将我的数据转换为 "Document, Term, Count" 数据框？从那里，很容易投射到 DTM，然后做我需要的。

Answer 1

试试这个

library(tm)
myCorpus <- Corpus(VectorSource(df))  
dtm <- DocumentTermMatrix(myCorpus)

除了将 df 替换为 df$column

之外，我已经将上述代码用于文本挖掘项目

Answer 2

您也可以使用 quanteda 包。

重新创建您的 data.frame：

df <- data.frame(Date = c("2015-01-01", "2015-01-01", "2015-01-03", "2015-01-01"),
                 Group = "Cars",
                 Reporting = c(rep("A", 3), "B"),
                 Comments = c(rep("This car is awesome", 3), "No comments"),
                 stringsAsFactors = FALSE)
df
#         Date Group Reporting            Comments
# 1 2015-01-01  Cars         A This car is awesome
# 2 2015-01-01  Cars         A This car is awesome
# 3 2015-01-03  Cars         A This car is awesome
# 4 2015-01-01  Cars         B         No comments

文档术语矩阵的简写方式：

dfm(df$Comments)
# Document-feature matrix of: 4 documents, 6 features (41.7% sparse).
# 4 x 6 sparse Matrix of class "dfmSparse"
#        features
# docs    this car is awesome no comments
#   text1    1   1  1       1  0        0
#   text2    1   1  1       1  0        0
#   text3    1   1  1       1  0        0
#   text4    0   0  0       0  1        1

距离dfm还有很长的路要走：

制作一个语料库，包括文档变量：

require(quanteda)
myCorpus <- corpus(df, text_field = "Comments")
summary(myCorpus)
# Corpus consisting of 4 documents.
# 
#  Text Types Tokens Sentences       Date Group Reporting
# text1     4      4         1 2015-01-01  Cars         A
# text2     4      4         1 2015-01-01  Cars         A
# text3     4      4         1 2015-01-03  Cars         A
# text4     2      2         1 2015-01-01  Cars         B
# 
# Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
# Created: Wed Jun 21 23:34:35 2017
# Notes:

然后：

dfm(myCorpus)
# Document-feature matrix of: 4 documents, 6 features (41.7% sparse).
# 4 x 6 sparse Matrix of class "dfmSparse"
#        features
# docs    this car is awesome no comments
#   text1    1   1  1       1  0        0
#   text2    1   1  1       1  0        0
#   text3    1   1  1       1  0        0
#   text4    0   0  0       0  1        1

如何将 Dataframe 转换为 DTM

How to Cast a Dataframe into a DTM

r

tidy

qdap

quanteda

tidytext

文档术语矩阵的简写方式：

距离dfm还有很长的路要走：