运行 stm 的单协变量主题建模问题
Problems to run stm for topic modelling with one single covariate
我正在尝试 运行 使用 stm 进行 LDA 主题建模分析,但我的元数据有问题,它似乎工作正常但我有一个协变量(年龄)未被读取,如图所示在这个例子中。
我有一些推文(excel 文件中的 docu 列)具有年龄协变量(年轻,年老)值..
这是我的数据
http://www.mediafire.com/file/5eb9qe6gbg22o9i/dada.xlsx/file
library(stm)
library(readxl)
library(quanteda)
library(stringr)
library(tm)
data <- read_xlsx("C:/dada.xlsx")
#Remove URL's
data$docu <- str_replace_all(data$docu, "https://t.co/[a-z,A-Z,0-9]*","")
data$docu <- gsub("@\w+", " ", data$docu) # Remove user names (all proper names if you're wise!)
data$docu <- iconv(data$docu, to = "ASCII", sub = " ") # Convert to basic ASCII text to avoid silly characters
data$docu <- gsub("#\w+", " ", data$docu)
data$docu <- gsub("http.+ |http.+$", " ", data$docu) # Remove links
data$docu <- gsub("[[:punct:]]", " ", data$docu) # Remove punctuation)
data$docu<- gsub("[\r\n]", "", data$docu)
data$docu <- tolower(data$docu)
#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
data$docu <- tm::removeWords(x = data$docu, c(stopwords(kind = "SMART")))
data$docu <- gsub(" +", " ", data$docu) # General spaces (should just do all whitespaces no?)
myCorpus <- corpus(data$docu)
docvars(myCorpus, "Age") <- as.factor(data$Age)
processed <- textProcessor(data$docu, metadata = data)
out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 2)
out$documents
out$meta
levels(out$meta)
First_STM <- stm(documents = out$documents, vocab = out$vocab,
K = 4, prevalence =~ Age ,
max.em.its = 25, data = out$meta,
init.type = "LDA", verbose = FALSE)
如代码所示,我尝试将 Age 定义为因素,我认为这不是必需的,因为 运行ning textProcessor
可能就足够了..但是当我 运行
levels(out$meta)
我得到 NULL
值,所以当我 运行 stm
得到实际主题时,我得到内存分配错误..
您将 Age
的元变量设置为该行中的因素
docvars(myCorpus, "Age") <- as.factor(data$Age)
但是您没有进一步使用 myCorpus。在接下来的步骤中,您将使用数据框 data
进行预处理。尝试在数据框中将 Age
定义为 factor:
data$Age <- factor(data$Age)
然后在这里之前使用它
processed <- textProcessor(data$docu, metadata = data)
out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 2)
然后您可以这样查看关卡:
levels(out$meta$Age)
虽然我无法重现您的内存分配错误。 stm 在我的机器上运行良好(Win 10 Pro,8GB Ram)。
我正在尝试 运行 使用 stm 进行 LDA 主题建模分析,但我的元数据有问题,它似乎工作正常但我有一个协变量(年龄)未被读取,如图所示在这个例子中。
我有一些推文(excel 文件中的 docu 列)具有年龄协变量(年轻,年老)值..
这是我的数据 http://www.mediafire.com/file/5eb9qe6gbg22o9i/dada.xlsx/file
library(stm)
library(readxl)
library(quanteda)
library(stringr)
library(tm)
data <- read_xlsx("C:/dada.xlsx")
#Remove URL's
data$docu <- str_replace_all(data$docu, "https://t.co/[a-z,A-Z,0-9]*","")
data$docu <- gsub("@\w+", " ", data$docu) # Remove user names (all proper names if you're wise!)
data$docu <- iconv(data$docu, to = "ASCII", sub = " ") # Convert to basic ASCII text to avoid silly characters
data$docu <- gsub("#\w+", " ", data$docu)
data$docu <- gsub("http.+ |http.+$", " ", data$docu) # Remove links
data$docu <- gsub("[[:punct:]]", " ", data$docu) # Remove punctuation)
data$docu<- gsub("[\r\n]", "", data$docu)
data$docu <- tolower(data$docu)
#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
data$docu <- tm::removeWords(x = data$docu, c(stopwords(kind = "SMART")))
data$docu <- gsub(" +", " ", data$docu) # General spaces (should just do all whitespaces no?)
myCorpus <- corpus(data$docu)
docvars(myCorpus, "Age") <- as.factor(data$Age)
processed <- textProcessor(data$docu, metadata = data)
out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 2)
out$documents
out$meta
levels(out$meta)
First_STM <- stm(documents = out$documents, vocab = out$vocab,
K = 4, prevalence =~ Age ,
max.em.its = 25, data = out$meta,
init.type = "LDA", verbose = FALSE)
如代码所示,我尝试将 Age 定义为因素,我认为这不是必需的,因为 运行ning textProcessor
可能就足够了..但是当我 运行
levels(out$meta)
我得到 NULL
值,所以当我 运行 stm
得到实际主题时,我得到内存分配错误..
您将 Age
的元变量设置为该行中的因素
docvars(myCorpus, "Age") <- as.factor(data$Age)
但是您没有进一步使用 myCorpus。在接下来的步骤中,您将使用数据框 data
进行预处理。尝试在数据框中将 Age
定义为 factor:
data$Age <- factor(data$Age)
然后在这里之前使用它
processed <- textProcessor(data$docu, metadata = data)
out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh = 2)
然后您可以这样查看关卡:
levels(out$meta$Age)
虽然我无法重现您的内存分配错误。 stm 在我的机器上运行良好(Win 10 Pro,8GB Ram)。