如何将 Dataframe 转换为 DTM
How to Cast a Dataframe into a DTM
我想将我的 table 转换为 DTM 并维护元数据。
每一行应该是一个文档。但是为了使用 cast_dtm(),需要有一个计数变量。为了 "cast",它需要采用 "Document, Term, Count" 格式。
如何将我的数据转换为 "Document, Term, Count" 数据框?从那里,很容易投射到 DTM,然后做我需要的。
试试这个
library(tm)
myCorpus <- Corpus(VectorSource(df))
dtm <- DocumentTermMatrix(myCorpus)
除了将 df 替换为 df$column
之外,我已经将上述代码用于文本挖掘项目
您也可以使用 quanteda 包。
重新创建您的 data.frame:
df <- data.frame(Date = c("2015-01-01", "2015-01-01", "2015-01-03", "2015-01-01"),
Group = "Cars",
Reporting = c(rep("A", 3), "B"),
Comments = c(rep("This car is awesome", 3), "No comments"),
stringsAsFactors = FALSE)
df
# Date Group Reporting Comments
# 1 2015-01-01 Cars A This car is awesome
# 2 2015-01-01 Cars A This car is awesome
# 3 2015-01-03 Cars A This car is awesome
# 4 2015-01-01 Cars B No comments
文档术语矩阵的简写方式:
dfm(df$Comments)
# Document-feature matrix of: 4 documents, 6 features (41.7% sparse).
# 4 x 6 sparse Matrix of class "dfmSparse"
# features
# docs this car is awesome no comments
# text1 1 1 1 1 0 0
# text2 1 1 1 1 0 0
# text3 1 1 1 1 0 0
# text4 0 0 0 0 1 1
距离dfm还有很长的路要走:
制作一个语料库,包括文档变量:
require(quanteda)
myCorpus <- corpus(df, text_field = "Comments")
summary(myCorpus)
# Corpus consisting of 4 documents.
#
# Text Types Tokens Sentences Date Group Reporting
# text1 4 4 1 2015-01-01 Cars A
# text2 4 4 1 2015-01-01 Cars A
# text3 4 4 1 2015-01-03 Cars A
# text4 2 2 1 2015-01-01 Cars B
#
# Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
# Created: Wed Jun 21 23:34:35 2017
# Notes:
然后:
dfm(myCorpus)
# Document-feature matrix of: 4 documents, 6 features (41.7% sparse).
# 4 x 6 sparse Matrix of class "dfmSparse"
# features
# docs this car is awesome no comments
# text1 1 1 1 1 0 0
# text2 1 1 1 1 0 0
# text3 1 1 1 1 0 0
# text4 0 0 0 0 1 1
我想将我的 table 转换为 DTM 并维护元数据。
每一行应该是一个文档。但是为了使用 cast_dtm(),需要有一个计数变量。为了 "cast",它需要采用 "Document, Term, Count" 格式。
如何将我的数据转换为 "Document, Term, Count" 数据框?从那里,很容易投射到 DTM,然后做我需要的。
试试这个
library(tm)
myCorpus <- Corpus(VectorSource(df))
dtm <- DocumentTermMatrix(myCorpus)
除了将 df 替换为 df$column
之外,我已经将上述代码用于文本挖掘项目您也可以使用 quanteda 包。
重新创建您的 data.frame:
df <- data.frame(Date = c("2015-01-01", "2015-01-01", "2015-01-03", "2015-01-01"),
Group = "Cars",
Reporting = c(rep("A", 3), "B"),
Comments = c(rep("This car is awesome", 3), "No comments"),
stringsAsFactors = FALSE)
df
# Date Group Reporting Comments
# 1 2015-01-01 Cars A This car is awesome
# 2 2015-01-01 Cars A This car is awesome
# 3 2015-01-03 Cars A This car is awesome
# 4 2015-01-01 Cars B No comments
文档术语矩阵的简写方式:
dfm(df$Comments)
# Document-feature matrix of: 4 documents, 6 features (41.7% sparse).
# 4 x 6 sparse Matrix of class "dfmSparse"
# features
# docs this car is awesome no comments
# text1 1 1 1 1 0 0
# text2 1 1 1 1 0 0
# text3 1 1 1 1 0 0
# text4 0 0 0 0 1 1
距离dfm还有很长的路要走:
制作一个语料库,包括文档变量:
require(quanteda)
myCorpus <- corpus(df, text_field = "Comments")
summary(myCorpus)
# Corpus consisting of 4 documents.
#
# Text Types Tokens Sentences Date Group Reporting
# text1 4 4 1 2015-01-01 Cars A
# text2 4 4 1 2015-01-01 Cars A
# text3 4 4 1 2015-01-03 Cars A
# text4 2 2 1 2015-01-01 Cars B
#
# Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
# Created: Wed Jun 21 23:34:35 2017
# Notes:
然后:
dfm(myCorpus)
# Document-feature matrix of: 4 documents, 6 features (41.7% sparse).
# 4 x 6 sparse Matrix of class "dfmSparse"
# features
# docs this car is awesome no comments
# text1 1 1 1 1 0 0
# text2 1 1 1 1 0 0
# text3 1 1 1 1 0 0
# text4 0 0 0 0 1 1