在 R 中将 `BTM` 与 `predict` 结合使用会输出 0.1 的统一主题概率

Using `BTM` with `predict` in R is outputting uniform topic probabilities of 0.1

我有一个 3000 x 2 的语料库数据框,其名称为 dfcorpus,由两列组成:文档 ID 和文本(小写和预处理)。我在 R 中使用 biterm 主题模型 BTM 包如下:

library(BTM)

> model = BTM(data = dfcorpus,
              k = 10,
              detailed = TRUE,
              trace = TRUE)

> model
Biterm Topic Model
  trained with 1000 Gibbs iterations, alpha: 5, beta: 0.01
  topics: 10
  size of the token vocabulary: 29768
  topic distribution theta: 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

如你所见,主题分布是直线0.1s。这也级联到所有文档的推理输出:

> head(predict(model, newdata = dfcorpus))
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1   0.1
2  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1   0.1
3  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1   0.1
4  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1   0.1
5  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1   0.1
6  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1   0.1

人们会认为是因为数据的缘故,theta 为 0.1s。但是,我使用了 original C++ implementation instead of the R wrapper and I got different more versatile probabilities. I also followed the examples in the documentation of BTM 并且能够获得多种概率。这暗示问题出在我的 R 实现上,所以我想知道我的实现有什么问题。


编辑

我在观察奇怪行为时使用的数据是提供的模拟数据here

关于我按照的例子得到万能的概率,我列在下(source):

> library(udpipe)
> library(BTM)
> data("brussels_reviews_anno", package = "udpipe")
> x <- subset(brussels_reviews_anno, language == "nl")
> x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
> x <- x[, c("doc_id", "lemma")]
> model  <- BTM(x, k = 5, alpha = 1, beta = 0.01, iter = 10, trace = TRUE)
2020-07-29 11:41:04 Start Gibbs sampling iteration 1/10
2020-07-29 11:41:04 Start Gibbs sampling iteration 2/10
2020-07-29 11:41:04 Start Gibbs sampling iteration 3/10
2020-07-29 11:41:04 Start Gibbs sampling iteration 4/10
2020-07-29 11:41:05 Start Gibbs sampling iteration 5/10
2020-07-29 11:41:05 Start Gibbs sampling iteration 6/10
2020-07-29 11:41:05 Start Gibbs sampling iteration 7/10
2020-07-29 11:41:05 Start Gibbs sampling iteration 8/10
2020-07-29 11:41:05 Start Gibbs sampling iteration 9/10
2020-07-29 11:41:05 Start Gibbs sampling iteration 10/10
> model
Biterm Topic Model
  trained with 10 Gibbs iterations, alpha: 1, beta: 0.01
  topics: 5
  size of the token vocabulary: 1667
  topic distribution theta: 0.202 0.208 0.221 0.189 0.179

注意主题分布概率。现在推断 $p(z|d)$:

> head(predict(model, newdata = x))
               [,1]       [,2]       [,3]       [,4]       [,5]
10185723 0.18745036 0.23528461 0.17010275 0.18650434 0.22065795
10284782 0.03442299 0.22308442 0.37919379 0.18913706 0.17416174
10597787 0.11599689 0.13501897 0.09783666 0.58588106 0.06526642
10789408 0.14422460 0.45259918 0.16515674 0.10697039 0.13104909
10809161 0.15523735 0.06510719 0.48236257 0.26374167 0.03355122
10913343 0.06599192 0.15961549 0.46651229 0.06327143 0.24460886

您的 dfcorpus 应该是标记化的 data.frame,如 BTM 在 https://cran.r-project.org/web/packages/BTM/BTM.pdf

的帮助中所示
## Get your data - 1 row per document
x <- readLines("https://raw.githubusercontent.com/xiaohuiyan/BTM/master/sample-data/doc_info.txt")
x <- data.frame(doc_id = 1:length(x), text = x, stringsAsFactors = FALSE)
nrow(x)
#> [1] 30000
## Tokenise text - split by space to get 1 row per token - see help of the BTM function: ?BTM
library(udpipe)
dfcorpus <- strsplit.data.frame(x, term = "text", group = "doc_id", split = " ")
nrow(dfcorpus)
#> [1] 148245
head(dfcorpus, n = 10)
#>    doc_id      text
#> 1       1      ipod
#> 2       1    tumblr
#> 3       1       app
#> 4       1   working
#> 5       1      fine
#> 6       2       yup
#> 7       2    longer
#> 8       3    jungle
#> 9       3 abduction
#> 10      3     alive
## Build the model - using iter 20 to quickly show something but you should normally take 1000/2000 Gibbs iterations
library(BTM)
model <- BTM(data = dfcorpus, k = 10, detailed = TRUE, iter = 20, trace = 10)
#> 2020-07-29 20:20:08 Start Gibbs sampling iteration 1/20
#> 2020-07-29 20:20:16 Start Gibbs sampling iteration 11/20
model
#> Biterm Topic Model
#>   trained with 20 Gibbs iterations, alpha: 5, beta: 0.01
#>   topics: 10
#>   size of the token vocabulary: 28634
#>   topic distribution theta: 0.075 0.11 0.116 0.13 0.13 0.08 0.087 0.089 0.075 0.107
## Use predict and show some
scores <- predict(model, newdata = dfcorpus)
head(scores, n = 5)
#>           [,1]        [,2]         [,3]         [,4]         [,5]         [,6]
#> 1 2.682796e-02 0.016116481 8.207263e-02 4.334003e-01 2.908850e-01 1.805332e-02
#> 2 3.566287e-05 0.191604268 9.500306e-05 4.715159e-01 2.077098e-08 3.379564e-08
#> 3 1.312881e-01 0.003985054 1.152026e-02 7.568868e-03 4.480935e-01 1.140198e-04
#> 4 1.315275e-07 0.029871326 5.941080e-06 7.688591e-08 7.668167e-08 2.859951e-05
#> 5 5.525665e-02 0.001815486 3.407789e-03 6.540897e-01 5.049276e-03 2.669286e-03
#>           [,7]         [,8]         [,9]      [,10]
#> 1 2.829724e-06 2.375107e-03 9.312936e-02 0.03713697
#> 2 3.089975e-08 9.064206e-06 1.156608e-01 0.22107922
#> 3 8.330137e-04 3.289586e-01 3.613660e-03 0.06402498
#> 4 2.354538e-05 9.512406e-01 1.320487e-07 0.01882959
#> 5 7.635004e-07 6.455946e-02 1.450137e-02 0.19865020
## Plot the model
library(textplot)
library(ggraph)
#> Loading required package: ggplot2
library(concaveman)
plot(model)

reprex package (v0.3.0)

于 2020-07-29 创建

您可能还应该按频率限制模型中的单词,或仅选择某些词性标签(例如使用 udpipe R 包)以将模型中的单词限制为最有意义的单词以获得明智的主题.