使用 rstan MCMC 对简单二项式 GLM 进行低效采样
Inefficient sampling of simple binomial GLM using rstan MCMC
我正在尝试使用 rethinking
包(借鉴 rstan
MCMC)来拟合二项式 GLM。
模型适合,但抽样效率低下,Rhat 表示出了点问题。我不明白这个拟合问题的原因。
这是数据:
d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(dummy)
这是型号:
m1_stan <- map2stan(
alist(
afd_votes ~ dbinom(votes_total, p),
logit(p) <- beta0 + beta1*foreigner_n,
beta0 ~ dnorm(0, 10),
beta1 ~ dnorm(0, 10)
),
data = d,
WAIC = FALSE,
iter = 1000)
拟合诊断(Rhat,有效样本数)表明出现问题:
Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
beta0 -3.75 0 -3.75 -3.75 3 2.21
beta1 0.00 0 0.00 0.00 10 1.25
traceplot 不显示 "fat hairy caterpillar":
stan 输出建议增加两个参数,adapt_delta
和 max_treedepth
,我照做了。这在一定程度上改进了采样过程:
Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
beta0 18.1 0.09 18.11 18.16 28 1.06
beta1 0.0 0.00 0.00 0.00 28 1.06
但是正如跟踪图所示,还是有问题:
配对情节看起来也很奇怪:
我还尝试了什么:
- I centered/z-standardized 预测器(产生此错误:“"Error in sampler$call_sampler(args_list[[i]]) : Initialization failed.")
- 我尝试了普通模型(但它是计数数据)
- 我检查过没有遗漏(有none)
- 我将迭代次数增加到4000次,没有改善
- 我增加了先验的 sd(模型需要很长时间才能适应)
但到目前为止没有任何帮助。拟合效果不佳的原因可能是什么?我可以尝试什么?
每一个中的大计数会是个问题吗?
mean(d_short$afd_votes)
[1] 19655.83
数据摘录:
head(d)
afd_votes votes_total foreigner_n
1 11647 170396 16100
2 9023 138075 12600
3 11176 130875 11000
4 11578 156268 9299
5 10390 150173 25099
6 11161 130514 13000
会话信息:
sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] viridis_0.5.1 viridisLite_0.3.0 sjmisc_2.7.3 pradadata_0.1.3 rethinking_1.59 rstan_2.17.3 StanHeaders_2.17.2 forcats_0.3.0 stringr_1.3.1
[10] dplyr_0.7.6 purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_3.0.0 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] httr_1.3.1 jsonlite_1.5 modelr_0.1.2 assertthat_0.2.0 stats4_3.5.0 cellranger_1.1.0 yaml_2.1.19 pillar_1.3.0 backports_1.1.2
[10] lattice_0.20-35 glue_1.3.0 digest_0.6.15 rvest_0.3.2 snakecase_0.9.1 colorspace_1.3-2 htmltools_0.3.6 plyr_1.8.4 pkgconfig_2.0.1
[19] broom_0.5.0 haven_1.1.2 bookdown_0.7 mvtnorm_1.0-8 scales_0.5.0 stringdist_0.9.5.1 sjlabelled_1.0.12 withr_2.1.2 RcppTOML_0.1.3
[28] lazyeval_0.2.1 cli_1.0.0 magrittr_1.5 crayon_1.3.4 readxl_1.1.0 evaluate_0.11 nlme_3.1-137 MASS_7.3-50 xml2_1.2.0
[37] blogdown_0.8 tools_3.5.0 loo_2.0.0 data.table_1.11.4 hms_0.4.2 matrixStats_0.54.0 munsell_0.5.0 prediction_0.3.6 bindrcpp_0.2.2
[46] compiler_3.5.0 rlang_0.2.1 grid_3.5.0 rstudioapi_0.7 labeling_0.3 rmarkdown_1.10 gtable_0.2.0 codetools_0.2-15 inline_0.3.15
[55] curl_3.2 R6_2.2.2 gridExtra_2.3 lubridate_1.7.4 knitr_1.20 bindr_0.1.1 rprojroot_1.3-2 KernSmooth_2.23-15 stringi_1.2.4
[64] Rcpp_0.12.18 tidyselect_0.2.4 xfun_0.3 coda_0.19-1
STAN 在单位尺度、不相关参数方面做得更好。 来自 STAN 手册 §28.4 模型调节和曲率:
Ideally, all parameters should be programmed so that they have unit
scale and so that posterior correlation is reduced; together, these
properties mean that there is no rotation or scaling required for
optimal performance of Stan’s algorithms. For Hamiltonian Monte Carlo,
this implies a unit mass matrix, which requires no adaptation as it is
where the algorithm initializes. Riemannian Hamiltonian Monte Carlo
performs this conditioning on the fly at every step, but such
conditioning is very expensive computationally.
在您的例子中,beta1
与 foreigner_n
相关联,后者没有单位比例,因此与 beta0
相比是不平衡的。此外,由于 foreigner_n
未居中,因此两个 beta 在采样期间都在改变 p
的位置,因此存在 post 之前的相关性。
标准化会产生更易于处理的模型。将 foreigner_n
转换为居中和单位比例使模型能够快速收敛并产生高有效样本量。我还认为这种形式的贝塔更容易解释,因为 beta0
只关注 p
的位置,而 beta1
只关注 foreigner_n
的变化如何解释afd_votes/total_votes
.
的变化
library(readr)
library(rethinking)
d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(d)
d$foreigner_z <- scale(d$foreigner_n)
m1 <- alist(
afd_votes ~ dbinom(votes_total, p),
logit(p) <- beta0 + beta1*foreigner_z,
c(beta0, beta1) ~ dnorm(0, 1)
)
m1_stan <- map2stan(m1, data = d, WAIC = FALSE,
iter = 10000, chains = 4, cores = 4)
检查抽样,我们有
> summary(m1_stan)
Inference for Stan model: afd_votes ~ dbinom(votes_total, p).
4 chains, each with iter=10000; warmup=5000; thin=1;
post-warmup draws per chain=5000, total post-warmup draws=20000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 -1.95 0.00 0.00 -1.95 -1.95 -1.95 -1.95 -1.95 16352 1
beta1 -0.24 0.00 0.00 -0.24 -0.24 -0.24 -0.24 -0.24 13456 1
dev 861952.93 0.02 1.97 861950.98 861951.50 861952.32 861953.73 861958.26 9348 1
lp__ -17523871.11 0.01 0.99 -17523873.77 -17523871.51 -17523870.80 -17523870.39 -17523870.13 9348 1
Samples were drawn using NUTS(diag_e) at Sat Sep 1 11:48:55 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
查看配对图,我们看到 beta 之间的相关性降低到 0.15:
补充分析
我最初凭直觉认为未居中的 foreigner_n
是主要问题。同时,我有点困惑,因为 STAN 使用的是 HMC/NUTS,我一直认为它对于相关的潜在变量应该相当稳健。但是,我注意到 STAN 手册中关于由于数值不稳定性导致的尺度不变性的实际问题的评论,这些评论也是 commented on by Michael Betancourt in a CrossValidated answer(尽管它是一个相当古老的 post)。因此,我想测试居中或缩放是否对改进采样最有影响。
独自居中
居中仍然导致性能很差。请注意,有效样本大小实际上是每个链一个有效样本。
library(readr)
library(rethinking)
d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(d)
d$foreigner_c <- d$foreigner_n - mean(d$foreigner_n)
m2 <- alist(
afd_votes ~ dbinom(votes_total, p),
logit(p) <- beta0 + beta1*foreigner_c,
c(beta0, beta1) ~ dnorm(0, 1)
)
m2_stan <- map2stan(m2, data = d, WAIC = FALSE,
iter = 10000, chains = 4, cores = 4)
产生
Inference for Stan model: afd_votes ~ dbinom(votes_total, p).
4 chains, each with iter=10000; warmup=5000; thin=1;
post-warmup draws per chain=5000, total post-warmup draws=20000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 -0.64 0.4 0.75 -1.95 -1.29 -0.54 0.2 0.42 4 2.34
beta1 0.00 0.0 0.00 0.00 0.00 0.00 0.0 0.00 4 2.35
dev 18311608.99 8859262.1 17270228.21 861951.86 3379501.84 14661443.24 37563992.4 46468786.08 4 1.75
lp__ -26248697.70 4429630.9 8635113.76 -40327285.85 -35874888.93 -24423614.49 -18782644.5 -17523870.54 4 1.75
Samples were drawn using NUTS(diag_e) at Sun Sep 2 18:59:52 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
并且似乎仍然存在有问题的配对图:
单独缩放
缩放大大改善了采样!尽管由此产生的 post 误差仍然具有相当高的相关性,但有效样本量仍在可接受的范围内,尽管远低于完全标准化的样本量。
library(readr)
library(rethinking)
d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(d)
d$foreigner_s <- d$foreigner_n / sd(d$foreigner_n)
m3 <- alist(
afd_votes ~ dbinom(votes_total, p),
logit(p) <- beta0 + beta1*foreigner_s,
c(beta0, beta1) ~ dnorm(0, 1)
)
m3_stan <- map2stan(m2, data = d, WAIC = FALSE,
iter = 10000, chains = 4, cores = 4)
屈服
Inference for Stan model: afd_votes ~ dbinom(votes_total, p).
4 chains, each with iter=10000; warmup=5000; thin=1;
post-warmup draws per chain=5000, total post-warmup draws=20000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 -1.58 0.00 0.00 -1.58 -1.58 -1.58 -1.58 -1.57 5147 1
beta1 -0.24 0.00 0.00 -0.24 -0.24 -0.24 -0.24 -0.24 5615 1
dev 861952.93 0.03 2.01 861950.98 861951.50 861952.31 861953.69 861958.31 5593 1
lp__ -17523870.45 0.01 1.00 -17523873.15 -17523870.83 -17523870.14 -17523869.74 -17523869.48 5591 1
Samples were drawn using NUTS(diag_e) at Sun Sep 2 19:02:00 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
配对图显示仍然存在显着相关性:
因此,虽然去相关变量确实改进了抽样,但消除规模不平衡在该模型中最为重要。
我正在尝试使用 rethinking
包(借鉴 rstan
MCMC)来拟合二项式 GLM。
模型适合,但抽样效率低下,Rhat 表示出了点问题。我不明白这个拟合问题的原因。
这是数据:
d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(dummy)
这是型号:
m1_stan <- map2stan(
alist(
afd_votes ~ dbinom(votes_total, p),
logit(p) <- beta0 + beta1*foreigner_n,
beta0 ~ dnorm(0, 10),
beta1 ~ dnorm(0, 10)
),
data = d,
WAIC = FALSE,
iter = 1000)
拟合诊断(Rhat,有效样本数)表明出现问题:
Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
beta0 -3.75 0 -3.75 -3.75 3 2.21
beta1 0.00 0 0.00 0.00 10 1.25
traceplot 不显示 "fat hairy caterpillar":
stan 输出建议增加两个参数,adapt_delta
和 max_treedepth
,我照做了。这在一定程度上改进了采样过程:
Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
beta0 18.1 0.09 18.11 18.16 28 1.06
beta1 0.0 0.00 0.00 0.00 28 1.06
但是正如跟踪图所示,还是有问题:
配对情节看起来也很奇怪:
我还尝试了什么:
- I centered/z-standardized 预测器(产生此错误:“"Error in sampler$call_sampler(args_list[[i]]) : Initialization failed.")
- 我尝试了普通模型(但它是计数数据)
- 我检查过没有遗漏(有none)
- 我将迭代次数增加到4000次,没有改善
- 我增加了先验的 sd(模型需要很长时间才能适应)
但到目前为止没有任何帮助。拟合效果不佳的原因可能是什么?我可以尝试什么?
每一个中的大计数会是个问题吗?
mean(d_short$afd_votes)
[1] 19655.83
数据摘录:
head(d)
afd_votes votes_total foreigner_n
1 11647 170396 16100
2 9023 138075 12600
3 11176 130875 11000
4 11578 156268 9299
5 10390 150173 25099
6 11161 130514 13000
会话信息:
sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] viridis_0.5.1 viridisLite_0.3.0 sjmisc_2.7.3 pradadata_0.1.3 rethinking_1.59 rstan_2.17.3 StanHeaders_2.17.2 forcats_0.3.0 stringr_1.3.1
[10] dplyr_0.7.6 purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_3.0.0 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] httr_1.3.1 jsonlite_1.5 modelr_0.1.2 assertthat_0.2.0 stats4_3.5.0 cellranger_1.1.0 yaml_2.1.19 pillar_1.3.0 backports_1.1.2
[10] lattice_0.20-35 glue_1.3.0 digest_0.6.15 rvest_0.3.2 snakecase_0.9.1 colorspace_1.3-2 htmltools_0.3.6 plyr_1.8.4 pkgconfig_2.0.1
[19] broom_0.5.0 haven_1.1.2 bookdown_0.7 mvtnorm_1.0-8 scales_0.5.0 stringdist_0.9.5.1 sjlabelled_1.0.12 withr_2.1.2 RcppTOML_0.1.3
[28] lazyeval_0.2.1 cli_1.0.0 magrittr_1.5 crayon_1.3.4 readxl_1.1.0 evaluate_0.11 nlme_3.1-137 MASS_7.3-50 xml2_1.2.0
[37] blogdown_0.8 tools_3.5.0 loo_2.0.0 data.table_1.11.4 hms_0.4.2 matrixStats_0.54.0 munsell_0.5.0 prediction_0.3.6 bindrcpp_0.2.2
[46] compiler_3.5.0 rlang_0.2.1 grid_3.5.0 rstudioapi_0.7 labeling_0.3 rmarkdown_1.10 gtable_0.2.0 codetools_0.2-15 inline_0.3.15
[55] curl_3.2 R6_2.2.2 gridExtra_2.3 lubridate_1.7.4 knitr_1.20 bindr_0.1.1 rprojroot_1.3-2 KernSmooth_2.23-15 stringi_1.2.4
[64] Rcpp_0.12.18 tidyselect_0.2.4 xfun_0.3 coda_0.19-1
STAN 在单位尺度、不相关参数方面做得更好。 来自 STAN 手册 §28.4 模型调节和曲率:
Ideally, all parameters should be programmed so that they have unit scale and so that posterior correlation is reduced; together, these properties mean that there is no rotation or scaling required for optimal performance of Stan’s algorithms. For Hamiltonian Monte Carlo, this implies a unit mass matrix, which requires no adaptation as it is where the algorithm initializes. Riemannian Hamiltonian Monte Carlo performs this conditioning on the fly at every step, but such conditioning is very expensive computationally.
在您的例子中,beta1
与 foreigner_n
相关联,后者没有单位比例,因此与 beta0
相比是不平衡的。此外,由于 foreigner_n
未居中,因此两个 beta 在采样期间都在改变 p
的位置,因此存在 post 之前的相关性。
标准化会产生更易于处理的模型。将 foreigner_n
转换为居中和单位比例使模型能够快速收敛并产生高有效样本量。我还认为这种形式的贝塔更容易解释,因为 beta0
只关注 p
的位置,而 beta1
只关注 foreigner_n
的变化如何解释afd_votes/total_votes
.
library(readr)
library(rethinking)
d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(d)
d$foreigner_z <- scale(d$foreigner_n)
m1 <- alist(
afd_votes ~ dbinom(votes_total, p),
logit(p) <- beta0 + beta1*foreigner_z,
c(beta0, beta1) ~ dnorm(0, 1)
)
m1_stan <- map2stan(m1, data = d, WAIC = FALSE,
iter = 10000, chains = 4, cores = 4)
检查抽样,我们有
> summary(m1_stan)
Inference for Stan model: afd_votes ~ dbinom(votes_total, p).
4 chains, each with iter=10000; warmup=5000; thin=1;
post-warmup draws per chain=5000, total post-warmup draws=20000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 -1.95 0.00 0.00 -1.95 -1.95 -1.95 -1.95 -1.95 16352 1
beta1 -0.24 0.00 0.00 -0.24 -0.24 -0.24 -0.24 -0.24 13456 1
dev 861952.93 0.02 1.97 861950.98 861951.50 861952.32 861953.73 861958.26 9348 1
lp__ -17523871.11 0.01 0.99 -17523873.77 -17523871.51 -17523870.80 -17523870.39 -17523870.13 9348 1
Samples were drawn using NUTS(diag_e) at Sat Sep 1 11:48:55 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
查看配对图,我们看到 beta 之间的相关性降低到 0.15:
补充分析
我最初凭直觉认为未居中的 foreigner_n
是主要问题。同时,我有点困惑,因为 STAN 使用的是 HMC/NUTS,我一直认为它对于相关的潜在变量应该相当稳健。但是,我注意到 STAN 手册中关于由于数值不稳定性导致的尺度不变性的实际问题的评论,这些评论也是 commented on by Michael Betancourt in a CrossValidated answer(尽管它是一个相当古老的 post)。因此,我想测试居中或缩放是否对改进采样最有影响。
独自居中
居中仍然导致性能很差。请注意,有效样本大小实际上是每个链一个有效样本。
library(readr)
library(rethinking)
d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(d)
d$foreigner_c <- d$foreigner_n - mean(d$foreigner_n)
m2 <- alist(
afd_votes ~ dbinom(votes_total, p),
logit(p) <- beta0 + beta1*foreigner_c,
c(beta0, beta1) ~ dnorm(0, 1)
)
m2_stan <- map2stan(m2, data = d, WAIC = FALSE,
iter = 10000, chains = 4, cores = 4)
产生
Inference for Stan model: afd_votes ~ dbinom(votes_total, p).
4 chains, each with iter=10000; warmup=5000; thin=1;
post-warmup draws per chain=5000, total post-warmup draws=20000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 -0.64 0.4 0.75 -1.95 -1.29 -0.54 0.2 0.42 4 2.34
beta1 0.00 0.0 0.00 0.00 0.00 0.00 0.0 0.00 4 2.35
dev 18311608.99 8859262.1 17270228.21 861951.86 3379501.84 14661443.24 37563992.4 46468786.08 4 1.75
lp__ -26248697.70 4429630.9 8635113.76 -40327285.85 -35874888.93 -24423614.49 -18782644.5 -17523870.54 4 1.75
Samples were drawn using NUTS(diag_e) at Sun Sep 2 18:59:52 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
并且似乎仍然存在有问题的配对图:
单独缩放
缩放大大改善了采样!尽管由此产生的 post 误差仍然具有相当高的相关性,但有效样本量仍在可接受的范围内,尽管远低于完全标准化的样本量。
library(readr)
library(rethinking)
d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(d)
d$foreigner_s <- d$foreigner_n / sd(d$foreigner_n)
m3 <- alist(
afd_votes ~ dbinom(votes_total, p),
logit(p) <- beta0 + beta1*foreigner_s,
c(beta0, beta1) ~ dnorm(0, 1)
)
m3_stan <- map2stan(m2, data = d, WAIC = FALSE,
iter = 10000, chains = 4, cores = 4)
屈服
Inference for Stan model: afd_votes ~ dbinom(votes_total, p).
4 chains, each with iter=10000; warmup=5000; thin=1;
post-warmup draws per chain=5000, total post-warmup draws=20000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 -1.58 0.00 0.00 -1.58 -1.58 -1.58 -1.58 -1.57 5147 1
beta1 -0.24 0.00 0.00 -0.24 -0.24 -0.24 -0.24 -0.24 5615 1
dev 861952.93 0.03 2.01 861950.98 861951.50 861952.31 861953.69 861958.31 5593 1
lp__ -17523870.45 0.01 1.00 -17523873.15 -17523870.83 -17523870.14 -17523869.74 -17523869.48 5591 1
Samples were drawn using NUTS(diag_e) at Sun Sep 2 19:02:00 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
配对图显示仍然存在显着相关性:
因此,虽然去相关变量确实改进了抽样,但消除规模不平衡在该模型中最为重要。