aggregate.data.frame(as.data.frame(x), ...) 中的错误:参数必须具有相同的长度
Error in aggregate.data.frame(as.data.frame(x), ...) : arguments must have same length
您好,我正在处理本教程中的最后一个示例:主题随时间变化的比例。
https://tm4ss.github.io/docs/Tutorial_6_Topic_Models.html
我运行它是我的数据用这个代码
library(readxl)
library(tm)
# Import text data
tweets <- read_xlsx("C:/R/data.xlsx")
textdata <- tweets$text
#Load in the library 'stringr' so we can use the str_replace_all function.
library('stringr')
#Remove URL's
textdata <- str_replace_all(textdata, "https://t.co/[a-z,A-Z,0-9]*","")
textdata <- gsub("@\w+", " ", textdata) # Remove user names (all proper names if you're wise!)
textdata <- iconv(textdata, to = "ASCII", sub = " ") # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\w+", " ", textdata)
textdata <- gsub("http.+ |http.+$", " ", textdata) # Remove links
textdata <- gsub("[[:punct:]]", " ", textdata) # Remove punctuation
#Change all the text to lower case
textdata <- tolower(textdata)
#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))
textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)
# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata)) # Create corpus object
#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]
#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See:
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document
#dtm.new <- datatm[rowTotals> 0, ]
library("ldatuning")
library("topicmodels")
k <- 7
ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
#####################################################
#topics by year
tmResult <- posterior(ldaTopics)
tmResult
theta <- tmResult$topics
dim(theta)
library(ggplot2)
terms(ldaTopics, 7)
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)
top5termsPerTopic <- terms(topicModel, 7)
topicNames <- apply(top5termsPerTopic, 2, paste, collapse=" ")
# set topic names to aggregated columns
colnames(topic_proportion_per_decade)[2:(K+1)] <- topicNames
# reshape data frame
vizDataFrame <- melt(topic_proportion_per_decade, id.vars = "decade")
# plot topic proportions per deacde as bar plot
require(pals)
ggplot(vizDataFrame, aes(x=decade, y=value, fill=variable)) +
geom_bar(stat = "identity") + ylab("proportion") +
scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "decade") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
这是输入数据的excel文件
https://www.mediafire.com/file/4w2hkgzzzaaax88/data.xlsx/file
当我 运行 带有聚合函数的行时出现错误,我无法找出聚合发生了什么,我创建了 "decade" 变量tutoria,我展示了它,看起来不错,theta 变量也不错..我根据这个 post 改变了几次聚合函数
Error in aggregate.data.frame : arguments must have same length
但是还是一样的错误..求助
我不确定你想用命令实现什么
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)
据我所知你只生产了一个十年
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
table(tweets$decade)
2010
3481
通过从 tweets
到 textdata
的所有预处理,您将生成一些空行。这是您的问题开始的地方。
带有新空行的文本数据是 corpus
和 dtm
的基础。你用这些行摆脱它们:
ui = unique(dtm$i)
dtm.new = dtm[ui,]
与此同时,您基本上是在删除 dtm 中的空列,从而改变对象的长度。这个没有空单元格的新 dtm 是
然后是主题模型的新基础。当您尝试将 aggregate()
与两个不同长度的对象一起使用时,这又会困扰您:tweets$decade
,它仍然是 3418 与 theta
的旧长度,由主题模型,它又基于 dtm.new —— 请记住,行数较少的模型。
我的建议是,首先在 tweets
中获取一个 ID 列。稍后您可以使用这些 ID 找出哪些文本稍后会被预处理删除并匹配 tweet$decade
和 theta
.
的长度
我重写了您的代码 -- 试试这个:
library(readxl)
library(tm)
# Import text data
tweets <- read_xlsx("data.xlsx")
## Include ID for later
tweets$ID <- 1:nrow(tweets)
textdata <- tweets$text
#Load in the library 'stringr' so we can use the str_replace_all function.
library('stringr')
#Remove URL's
textdata <- str_replace_all(textdata, "https://t.co/[a-z,A-Z,0-9]*","")
textdata <- gsub("@\w+", " ", textdata) # Remove user names (all proper names if you're wise!)
textdata <- iconv(textdata, to = "ASCII", sub = " ") # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\w+", " ", textdata)
textdata <- gsub("http.+ |http.+$", " ", textdata) # Remove links
textdata <- gsub("[[:punct:]]", " ", textdata) # Remove punctuation
#Change all the text to lower case
textdata <- tolower(textdata)
#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))
textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)
# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata)) # Create corpus object
#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]
#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See:
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document
#dtm.new <- datatm[rowTotals> 0, ]
library("ldatuning")
library("topicmodels")
k <- 7
ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
#####################################################
#topics by year
tmResult <- posterior(ldaTopics)
tmResult
theta <- tmResult$topics
dim(theta)
library(ggplot2)
terms(ldaTopics, 7)
id <- data.frame(ID = dtm.new$dimnames$Docs)
colnames(id) <- "ID"
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
tweets_new <- merge(id, tweets, by.x="ID", by.y = "ID", all.x = T)
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets_new$decade), mean)
您好,我正在处理本教程中的最后一个示例:主题随时间变化的比例。 https://tm4ss.github.io/docs/Tutorial_6_Topic_Models.html
我运行它是我的数据用这个代码
library(readxl)
library(tm)
# Import text data
tweets <- read_xlsx("C:/R/data.xlsx")
textdata <- tweets$text
#Load in the library 'stringr' so we can use the str_replace_all function.
library('stringr')
#Remove URL's
textdata <- str_replace_all(textdata, "https://t.co/[a-z,A-Z,0-9]*","")
textdata <- gsub("@\w+", " ", textdata) # Remove user names (all proper names if you're wise!)
textdata <- iconv(textdata, to = "ASCII", sub = " ") # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\w+", " ", textdata)
textdata <- gsub("http.+ |http.+$", " ", textdata) # Remove links
textdata <- gsub("[[:punct:]]", " ", textdata) # Remove punctuation
#Change all the text to lower case
textdata <- tolower(textdata)
#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))
textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)
# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata)) # Create corpus object
#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]
#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See:
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document
#dtm.new <- datatm[rowTotals> 0, ]
library("ldatuning")
library("topicmodels")
k <- 7
ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
#####################################################
#topics by year
tmResult <- posterior(ldaTopics)
tmResult
theta <- tmResult$topics
dim(theta)
library(ggplot2)
terms(ldaTopics, 7)
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)
top5termsPerTopic <- terms(topicModel, 7)
topicNames <- apply(top5termsPerTopic, 2, paste, collapse=" ")
# set topic names to aggregated columns
colnames(topic_proportion_per_decade)[2:(K+1)] <- topicNames
# reshape data frame
vizDataFrame <- melt(topic_proportion_per_decade, id.vars = "decade")
# plot topic proportions per deacde as bar plot
require(pals)
ggplot(vizDataFrame, aes(x=decade, y=value, fill=variable)) +
geom_bar(stat = "identity") + ylab("proportion") +
scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "decade") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
这是输入数据的excel文件 https://www.mediafire.com/file/4w2hkgzzzaaax88/data.xlsx/file
当我 运行 带有聚合函数的行时出现错误,我无法找出聚合发生了什么,我创建了 "decade" 变量tutoria,我展示了它,看起来不错,theta 变量也不错..我根据这个 post 改变了几次聚合函数 Error in aggregate.data.frame : arguments must have same length
但是还是一样的错误..求助
我不确定你想用命令实现什么
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)
据我所知你只生产了一个十年
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
table(tweets$decade)
2010
3481
通过从 tweets
到 textdata
的所有预处理,您将生成一些空行。这是您的问题开始的地方。
带有新空行的文本数据是 corpus
和 dtm
的基础。你用这些行摆脱它们:
ui = unique(dtm$i)
dtm.new = dtm[ui,]
与此同时,您基本上是在删除 dtm 中的空列,从而改变对象的长度。这个没有空单元格的新 dtm 是
然后是主题模型的新基础。当您尝试将 aggregate()
与两个不同长度的对象一起使用时,这又会困扰您:tweets$decade
,它仍然是 3418 与 theta
的旧长度,由主题模型,它又基于 dtm.new —— 请记住,行数较少的模型。
我的建议是,首先在 tweets
中获取一个 ID 列。稍后您可以使用这些 ID 找出哪些文本稍后会被预处理删除并匹配 tweet$decade
和 theta
.
我重写了您的代码 -- 试试这个:
library(readxl)
library(tm)
# Import text data
tweets <- read_xlsx("data.xlsx")
## Include ID for later
tweets$ID <- 1:nrow(tweets)
textdata <- tweets$text
#Load in the library 'stringr' so we can use the str_replace_all function.
library('stringr')
#Remove URL's
textdata <- str_replace_all(textdata, "https://t.co/[a-z,A-Z,0-9]*","")
textdata <- gsub("@\w+", " ", textdata) # Remove user names (all proper names if you're wise!)
textdata <- iconv(textdata, to = "ASCII", sub = " ") # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\w+", " ", textdata)
textdata <- gsub("http.+ |http.+$", " ", textdata) # Remove links
textdata <- gsub("[[:punct:]]", " ", textdata) # Remove punctuation
#Change all the text to lower case
textdata <- tolower(textdata)
#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))
textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)
# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata)) # Create corpus object
#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]
#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See:
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document
#dtm.new <- datatm[rowTotals> 0, ]
library("ldatuning")
library("topicmodels")
k <- 7
ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
#####################################################
#topics by year
tmResult <- posterior(ldaTopics)
tmResult
theta <- tmResult$topics
dim(theta)
library(ggplot2)
terms(ldaTopics, 7)
id <- data.frame(ID = dtm.new$dimnames$Docs)
colnames(id) <- "ID"
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
tweets_new <- merge(id, tweets, by.x="ID", by.y = "ID", all.x = T)
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets_new$decade), mean)