如何在 lda 中保留已删除文本的文本 ID

How to keep the text id of removed text in lda

我有一个这样的数据框

dtext <- data.frame(id = c(1,2,3,4), text = c("here","This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.", "The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.", "There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."),stringsAsFactors = F)

我用这个为 lda 执行文本清理

library(quanteda)
library(topicmodels)
library(tidyverse)
toks <- tokens(dtext$text)
toks <- tokens_remove(toks, c(
  stopwords("en"),
  stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
toks <- toks %>% tokens_wordstem()
myDfm <- dfm(toks, ngrams = c(2,3)) %>%
    dfm_trim(min_termfreq = 0.75, termfreq_type = "quantile")
dtm <- convert(myDfm, to = "topicmodels")
lda <- LDA(dtm, k = 2, control = list(seed = 1234))

但是我注意到,在 dtm 中,当文本列不包含任何内容时,它会将其删除。

gammaDF <- as.data.frame(lda@gamma) 
toptopics <- as.data.frame(cbind(document = row.names(gammaDF), 
                                 topic = apply(gammaDF,1,function(x) names(gammaDF)[which(x==max(x))])))

但是,当我想获取第一个数据帧的主题和相关 ID 时,这给了我一个问题。我该怎么做才能得到正确的结果?

id, topic
2    1
3    2
4    1

在使用 applywhich:

转换为 dtm 之前,您可以获取任何包含 0 个单词的文本的 ID
library(quanteda)
library(topicmodels)
library(tidyverse)
toks <- tokens(dtext$text)
toks <- tokens_remove(toks, c(
    stopwords("en"),
    stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
toks <- toks %>% tokens_wordstem()
myDfm <- dfm(toks, ngrams = c(2,3)) %>%
    dfm_trim(min_termfreq = 0.75, termfreq_type = "quantile")

removed <- which(apply(myDfm, 1, sum) == 0)

结果:

> removed
text1 
    1 

这里的问题是 LDA() 从文档术语矩阵中删除行名并用简单的序列号替换它们。这不再对应于您原来的 dtext$id。但是您可以将 LDA id 替换为文档名称,然后 link 这返回到您的输入文本。

为了更清楚地说明这一点,我们首先要将您的 dtext$id 替换为可以与 LDA() returns.[=23 的序列号更清楚地区分的内容=]

# to distinguish your id from those from LDA()
dtext$id <- paste0("doc_", dtext$id)

# this takes the document name from "id"
toks <- corpus(dtext, docid_field = "id") %>%
  tokens()

然后运行您的其他步骤与上述完全相同。

我们可以看到第一个文档是空的(特征计数为零)。这是在将 dfm 转换为 "topicmodels" 格式时丢弃的那个。

ntoken(myDfm)
## text1 text2 text3 text4 
##     0    49    63   201

as.matrix(dtm[, 1:3])
##        Terms
## Docs    dataset_contain contain_movi movi_review
##   text2               1            1           1
##   text3               1            0           0
##   text4               0            0           0

然而,这些文档名称已被 LDA() 删除。

toptopics
##   document topic
## 1        1    V2
## 2        2    V2
## 3        3    V1

但是我们可以从 dtm 的行名中(重新)分配它们,这将对应 1:1 到 LDA().

返回的文档
toptopics$docname <- rownames(dtm)
toptopics
##   document topic docname
## 1        1    V2   text2
## 2        2    V2   text3
## 3        3    V1   text4

现在,toptopics$docname 可以与 dtext$id 合并,解决您的问题。