从 lda 对象恢复原始文档 ID
Restore original document id from lda object
我正在尝试使用来自 topicmodels
.
虽然使用 groupby()
over document 并在 gamma 上选择 top_n()
很容易从文档中提取最有可能的预测主题,但在 "beta" 估计中,唯一文档 ID 将被抑制输出,输出只包含三列(topic
、term
、beta
)。这不允许人们从给定文档的术语中获得 "consensus" 主题预测(测试版)。
以我自己的数据为例:
Sys.setlocale("LC_ALL","Chinese") # reset to simplified Chinese encoding as the text data is in Chinese
library(foreign)
library(dplyr)
library(plyr)
library(tidyverse)
library(tidytext)
library(tm)
library(topicmodels)
sample_dtm <- readRDS(gzcon(url("https://www.dropbox.com/s/gznqlncd9psx3wz/sample_dtm.rds?dl=1")))
lda_out <- LDA(sample_dtm, k = 2, control = list(seed = 1234))
word_topics <- tidy(lda_out, matrix = "beta")
head(word_topics, n = 4)
# A tibble: 6 x 3
topic term beta
<int> <chr> <dbl>
1 1 费解 8.49e- 4
2 2 费解 1.15e- 9
3 1 上 2.92e- 3
document_gamma <- tidy(lda_out, matrix = "gamma")
head(document_gamma, n = 4)
# A tibble: 6 x 3
document topic gamma
<chr> <int> <dbl>
1 1203232 1 0.00374
2 529660 1 0.0329
3 738921 1 0.00138
4 963374 1 0.302
我是否可以从 lda
输出中恢复文档 ID 并与 beta
估计值(word_topics
相结合,它存储为 data.frame
对象)?这样就可以更容易地比较 beta
和 gamma
.
共识的估计主题
如果我对你的理解正确,我相信你想要的功能是 augment()
,其中 returns 一个 table 每个原始 document-term 对一行,连接到话题。
Sys.setlocale("LC_ALL","Chinese") # reset to simplified Chinese encoding as the text data is in Chinese
#> Warning in Sys.setlocale("LC_ALL", "Chinese"): OS reports request to set
#> locale to "Chinese" cannot be honored
#> [1] ""
library(foreign)
library(dplyr)
library(plyr)
#> -------------------------------------------------------------------------
#> You have loaded plyr after dplyr - this is likely to cause problems.
#> If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
#> library(plyr); library(dplyr)
#> -------------------------------------------------------------------------
#>
#> Attaching package: 'plyr'
#> The following objects are masked from 'package:dplyr':
#>
#> arrange, count, desc, failwith, id, mutate, rename, summarise,
#> summarize
library(tidyverse)
library(tidytext)
library(tm)
library(topicmodels)
sample_dtm <- readRDS(gzcon(url("https://www.dropbox.com/s/gznqlncd9psx3wz/sample_dtm.rds?dl=1")))
lda_out <- LDA(sample_dtm, k = 2, control = list(seed = 1234))
augment(lda_out, sample_dtm)
#> # A tibble: 18,676 x 4
#> document term count .topic
#> <chr> <chr> <dbl> <dbl>
#> 1 649 作揖 1 1
#> 2 649 拳头 1 1
#> 3 649 赞 1 1
#> 4 656 住 1 1
#> 5 656 小区 1 1
#> 6 656 没 1 1
#> 7 656 注意 2 1
#> 8 1916 中国 1 1
#> 9 1916 中国台湾 1 1
#> 10 1916 反对 1 1
#> # … with 18,666 more rows
由 reprex package (v0.2.1)
于 2019-06-04 创建
这将 LDA 模型中的文档 ID 连接到主题。听起来你已经明白了这一点,但只是重申一下:
beta
矩阵是word-topic概率
gamma
矩阵是document-topic概率
我正在尝试使用来自 topicmodels
.
虽然使用 groupby()
over document 并在 gamma 上选择 top_n()
很容易从文档中提取最有可能的预测主题,但在 "beta" 估计中,唯一文档 ID 将被抑制输出,输出只包含三列(topic
、term
、beta
)。这不允许人们从给定文档的术语中获得 "consensus" 主题预测(测试版)。
以我自己的数据为例:
Sys.setlocale("LC_ALL","Chinese") # reset to simplified Chinese encoding as the text data is in Chinese
library(foreign)
library(dplyr)
library(plyr)
library(tidyverse)
library(tidytext)
library(tm)
library(topicmodels)
sample_dtm <- readRDS(gzcon(url("https://www.dropbox.com/s/gznqlncd9psx3wz/sample_dtm.rds?dl=1")))
lda_out <- LDA(sample_dtm, k = 2, control = list(seed = 1234))
word_topics <- tidy(lda_out, matrix = "beta")
head(word_topics, n = 4)
# A tibble: 6 x 3
topic term beta
<int> <chr> <dbl>
1 1 费解 8.49e- 4
2 2 费解 1.15e- 9
3 1 上 2.92e- 3
document_gamma <- tidy(lda_out, matrix = "gamma")
head(document_gamma, n = 4)
# A tibble: 6 x 3
document topic gamma
<chr> <int> <dbl>
1 1203232 1 0.00374
2 529660 1 0.0329
3 738921 1 0.00138
4 963374 1 0.302
我是否可以从 lda
输出中恢复文档 ID 并与 beta
估计值(word_topics
相结合,它存储为 data.frame
对象)?这样就可以更容易地比较 beta
和 gamma
.
如果我对你的理解正确,我相信你想要的功能是 augment()
,其中 returns 一个 table 每个原始 document-term 对一行,连接到话题。
Sys.setlocale("LC_ALL","Chinese") # reset to simplified Chinese encoding as the text data is in Chinese
#> Warning in Sys.setlocale("LC_ALL", "Chinese"): OS reports request to set
#> locale to "Chinese" cannot be honored
#> [1] ""
library(foreign)
library(dplyr)
library(plyr)
#> -------------------------------------------------------------------------
#> You have loaded plyr after dplyr - this is likely to cause problems.
#> If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
#> library(plyr); library(dplyr)
#> -------------------------------------------------------------------------
#>
#> Attaching package: 'plyr'
#> The following objects are masked from 'package:dplyr':
#>
#> arrange, count, desc, failwith, id, mutate, rename, summarise,
#> summarize
library(tidyverse)
library(tidytext)
library(tm)
library(topicmodels)
sample_dtm <- readRDS(gzcon(url("https://www.dropbox.com/s/gznqlncd9psx3wz/sample_dtm.rds?dl=1")))
lda_out <- LDA(sample_dtm, k = 2, control = list(seed = 1234))
augment(lda_out, sample_dtm)
#> # A tibble: 18,676 x 4
#> document term count .topic
#> <chr> <chr> <dbl> <dbl>
#> 1 649 作揖 1 1
#> 2 649 拳头 1 1
#> 3 649 赞 1 1
#> 4 656 住 1 1
#> 5 656 小区 1 1
#> 6 656 没 1 1
#> 7 656 注意 2 1
#> 8 1916 中国 1 1
#> 9 1916 中国台湾 1 1
#> 10 1916 反对 1 1
#> # … with 18,666 more rows
由 reprex package (v0.2.1)
于 2019-06-04 创建这将 LDA 模型中的文档 ID 连接到主题。听起来你已经明白了这一点,但只是重申一下:
beta
矩阵是word-topic概率gamma
矩阵是document-topic概率