我的 DocumentTermMatrix 减少到零列
My DocumentTermMatrix reduces to Zero columns
train <- read.delim('train.tsv', header= T, fileEncoding= "windows-1252",stringsAsFactors=F)
Train.tsv 包含 1,56,060 行文本,其中包含 4 个列名称 Phrase、PhraseID、SentenceID 和 Sentiment(范围为 0 到 4)。Phrase 列包含文本行。 (TM 包已经加载)
R版本:3.1.2; OS: Windows 7, 64 位, 4 GB 内存。
> dput(head(train,6))
structure(list(PhraseId = 1:6, SentenceId = c(1L, 1L, 1L, 1L,
1L, 1L), Phrase = c("A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .",
"A series of escapades demonstrating the adage that what is good for the goose",
"A series", "A", "series", "of escapades demonstrating the adage that what is good for the goose"
), Sentiment = c(1L, 2L, 2L, 2L, 2L, 2L)), .Names = c("PhraseId",
"SentenceId", "Phrase", "Sentiment"), row.names = c(NA, 6L), class = "data.frame")
这是火车文档的前 6 行。
clean_corpus <- function(corpus)
{
mycorpus <- tm_map(corpus, removeWords,stopwords("english"))
mycorpus <- tm_map(mycorpus, removeWords,c("movie","actor","actress"))
mycorpus <- tm_map(mycorpus, stripWhitespace)
mycorpus <- tm_map(mycorpus, tolower)
mycorpus <- tm_map(mycorpus, removeNumbers)
mycorpus <- tm_map(mycorpus, removePunctuation)
mycorpus <- tm_map(mycorpus, PlainTextDocument )
return(mycorpus)
}
# Build DTM
generateDTM <- function(df)
{
m <- list(Sentiment = "Sentiment", Phrase = "Phrase")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(df), readerControl = list(reader = myReader))
#Code to attach sentiment label with every text line
for (i in 1:length(mycorpus))
{
attr(mycorpus[[i]], "Sentiment") <- df$Sentiment[i]
}
mycorpus <- clean_corpus(mycorpus)
dtm <- DocumentTermMatrix(mycorpus)
return(dtm)
}
dtm1 <- generateDTM(train)
这里我做了两个函数。一个用于清理语料库,另一个用于制作 DTM(文档术语矩阵)。我还将每个情绪值与每一行文本相关联。现在,当我使用 dtm1 的尺寸时;它显示 156060 行但 0 列。
那么,我如何生成带有情感标签的 DTM?
当你设置你的reader时,你想映射一些东西到文档的"content",否则它不知道用什么文本来制作语料库。其他右值存储为元数据。尝试将代码更改为
m <- list(Sentiment = "Sentiment", content = "Phrase")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(df), readerControl = list(reader = myReader))
train <- read.delim('train.tsv', header= T, fileEncoding= "windows-1252",stringsAsFactors=F)
Train.tsv 包含 1,56,060 行文本,其中包含 4 个列名称 Phrase、PhraseID、SentenceID 和 Sentiment(范围为 0 到 4)。Phrase 列包含文本行。 (TM 包已经加载) R版本:3.1.2; OS: Windows 7, 64 位, 4 GB 内存。
> dput(head(train,6))
structure(list(PhraseId = 1:6, SentenceId = c(1L, 1L, 1L, 1L,
1L, 1L), Phrase = c("A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .",
"A series of escapades demonstrating the adage that what is good for the goose",
"A series", "A", "series", "of escapades demonstrating the adage that what is good for the goose"
), Sentiment = c(1L, 2L, 2L, 2L, 2L, 2L)), .Names = c("PhraseId",
"SentenceId", "Phrase", "Sentiment"), row.names = c(NA, 6L), class = "data.frame")
这是火车文档的前 6 行。
clean_corpus <- function(corpus)
{
mycorpus <- tm_map(corpus, removeWords,stopwords("english"))
mycorpus <- tm_map(mycorpus, removeWords,c("movie","actor","actress"))
mycorpus <- tm_map(mycorpus, stripWhitespace)
mycorpus <- tm_map(mycorpus, tolower)
mycorpus <- tm_map(mycorpus, removeNumbers)
mycorpus <- tm_map(mycorpus, removePunctuation)
mycorpus <- tm_map(mycorpus, PlainTextDocument )
return(mycorpus)
}
# Build DTM
generateDTM <- function(df)
{
m <- list(Sentiment = "Sentiment", Phrase = "Phrase")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(df), readerControl = list(reader = myReader))
#Code to attach sentiment label with every text line
for (i in 1:length(mycorpus))
{
attr(mycorpus[[i]], "Sentiment") <- df$Sentiment[i]
}
mycorpus <- clean_corpus(mycorpus)
dtm <- DocumentTermMatrix(mycorpus)
return(dtm)
}
dtm1 <- generateDTM(train)
这里我做了两个函数。一个用于清理语料库,另一个用于制作 DTM(文档术语矩阵)。我还将每个情绪值与每一行文本相关联。现在,当我使用 dtm1 的尺寸时;它显示 156060 行但 0 列。
那么,我如何生成带有情感标签的 DTM?
当你设置你的reader时,你想映射一些东西到文档的"content",否则它不知道用什么文本来制作语料库。其他右值存储为元数据。尝试将代码更改为
m <- list(Sentiment = "Sentiment", content = "Phrase")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(df), readerControl = list(reader = myReader))