使用with和intracluster相关性生成模拟数据
Generating simulated data with with and intracluster correlation
我有一个看起来像这样的数据集
d<–structure(list(groupid = c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 3L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L,
1L, 2L, 2L, 2L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), participant_id = c(1L,
1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L,
7L, 7L, 7L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L, 11L, 11L, 11L,
12L, 12L, 13L, 13L, 13L, 14L, 14L, 14L, 15L, 15L, 15L, 16L, 16L,
17L, 17L, 17L, 18L, 18L, 19L, 19L, 19L, 20L, 20L, 20L, 21L, 21L,
21L, 22L, 22L, 22L, 23L, 23L, 24L, 24L, 24L, 25L, 25L, 26L, 26L,
26L, 27L, 27L, 28L, 28L, 28L, 29L, 29L, 29L, 30L, 30L, 31L, 31L,
31L, 32L, 32L, 32L, 33L, 33L, 34L, 34L, 34L, 35L, 35L, 35L, 36L,
36L, 36L, 37L, 37L, 37L, 38L, 38L, 38L, 39L, 39L, 39L, 40L, 40L,
40L, 41L, 41L, 41L, 42L, 42L, 42L, 43L, 43L, 43L, 44L, 44L, 45L,
45L, 46L, 46L, 47L, 47L, 47L, 48L, 48L, 49L, 49L, 50L, 50L),
attrib1_A = c(0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1,
0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1,
0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
0, 0, 1, 1, 0), attrib1_B = c(1, 0, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0,
0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -134L), groups = structure(list(
participant_id = 1:50, .rows = structure(list(1:3, 4:5, 6:8,
9:11, 12:14, 15:17, 18:20, 21:22, 23:25, 26:28, 29:31,
32:33, 34:36, 37:39, 40:42, 43:44, 45:47, 48:49, 50:52,
53:55, 56:58, 59:61, 62:63, 64:66, 67:68, 69:71, 72:73,
74:76, 77:79, 80:81, 82:84, 85:87, 88:89, 90:92, 93:95,
96:98, 99:101, 102:104, 105:107, 108:110, 111:113, 114:116,
117:119, 120:121, 122:123, 124:125, 126:128, 129:130,
131:132, 133:134), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
# Groups: participant_id [4]
groupid participant_id attrib1_A attrib1_B
<int> <int> <dbl> <dbl>
1 2 1 0 1
2 2 1 1 0
3 2 1 0 0
4 1 2 0 0
5 1 2 1 0
6 2 3 1 0
7 2 3 0 0
8 2 3 0 1
9 2 4 0 1
10 2 4 1 0
其中groupid
表示participant_id
的集群。 attrib1
和 attrib2
是我想用于我想创建的变量的 DGP 的回归变量。
我想生成我的 binary 结果变量 y
遵循下面指定的数据生成过程 (DGP)。
$y=a+因子(attrib1)+因子(attrib2)$
其中a为常数:当attrib1
和attrib2
对应参考类别时y=1的概率。贝塔是 attrib1_A= 0.3
attrib1_B= -0.5
最后,我希望创建的变量 y 流过指定的簇内相关性(例如 0.05)。
簇内相关性 是簇间变异性除以簇内和簇间变异性之和。在我们的例子中,集群是 groupid。
有谁知道如何生成具有指定 DGP 和指定集群内相关性的变量?
非常感谢您的帮助
好的,所以这个答案是基于二项式 glmm 的 ICC 的一个简单公式,其中方差固定为 pi^2 / 3
你问题中的第一段代码对我不起作用,正如你想象的那样,收敛对于 0.05 的 ICC 是有问题的。但这是问题的设置。答案使用 broom.mixed
、arm
、performance
和 lme4
包
library(broom.mixed)
b1 <- .3 # first coef
b2 <- -.5 # second coef
# select second stage variance and check ICC
second_stage_var <- 10 # this is the sd
icc <- second_stage_var^2 /(second_stage_var^2 + ( (pi ^ 2) / 3))
icc
### generate data
id <- rep(letters, 50) #ids
id_effect <- rnorm(length(letters), 0, second_stage_var) #id effects
x1 <- rnorm(length(id), 10, 10) # first_covariate <
x2 <- rnorm(length(id), 30, 10) # second covariate <
df <- data.frame(id, id_effect, x1, x2)
probs <- arm::invlogit(1 + b1*x1 + b2*x2 + id_effect)
df$y <- rbinom(length(id), 1, prob = probs)
mod1 <- lme4::glmer(data = df, formula = y ~ x1 + x2 + (1|id), family =binomial())
summary(mod1)
## check icc from the performance package
performance::icc(mod1)
## check icc by "hand"
est <- tidy(mod1)
total <- est$estimate[[4]]^2 + (pi^2/3)
est$estimate[[4]]^2 / total
我有一个看起来像这样的数据集
d<–structure(list(groupid = c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 3L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L,
1L, 2L, 2L, 2L, 1L, 1L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), participant_id = c(1L,
1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L,
7L, 7L, 7L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L, 11L, 11L, 11L,
12L, 12L, 13L, 13L, 13L, 14L, 14L, 14L, 15L, 15L, 15L, 16L, 16L,
17L, 17L, 17L, 18L, 18L, 19L, 19L, 19L, 20L, 20L, 20L, 21L, 21L,
21L, 22L, 22L, 22L, 23L, 23L, 24L, 24L, 24L, 25L, 25L, 26L, 26L,
26L, 27L, 27L, 28L, 28L, 28L, 29L, 29L, 29L, 30L, 30L, 31L, 31L,
31L, 32L, 32L, 32L, 33L, 33L, 34L, 34L, 34L, 35L, 35L, 35L, 36L,
36L, 36L, 37L, 37L, 37L, 38L, 38L, 38L, 39L, 39L, 39L, 40L, 40L,
40L, 41L, 41L, 41L, 42L, 42L, 42L, 43L, 43L, 43L, 44L, 44L, 45L,
45L, 46L, 46L, 47L, 47L, 47L, 48L, 48L, 49L, 49L, 50L, 50L),
attrib1_A = c(0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1,
0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1,
0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
0, 0, 1, 1, 0), attrib1_B = c(1, 0, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0,
0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -134L), groups = structure(list(
participant_id = 1:50, .rows = structure(list(1:3, 4:5, 6:8,
9:11, 12:14, 15:17, 18:20, 21:22, 23:25, 26:28, 29:31,
32:33, 34:36, 37:39, 40:42, 43:44, 45:47, 48:49, 50:52,
53:55, 56:58, 59:61, 62:63, 64:66, 67:68, 69:71, 72:73,
74:76, 77:79, 80:81, 82:84, 85:87, 88:89, 90:92, 93:95,
96:98, 99:101, 102:104, 105:107, 108:110, 111:113, 114:116,
117:119, 120:121, 122:123, 124:125, 126:128, 129:130,
131:132, 133:134), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
# Groups: participant_id [4]
groupid participant_id attrib1_A attrib1_B
<int> <int> <dbl> <dbl>
1 2 1 0 1
2 2 1 1 0
3 2 1 0 0
4 1 2 0 0
5 1 2 1 0
6 2 3 1 0
7 2 3 0 0
8 2 3 0 1
9 2 4 0 1
10 2 4 1 0
其中groupid
表示participant_id
的集群。 attrib1
和 attrib2
是我想用于我想创建的变量的 DGP 的回归变量。
我想生成我的 binary 结果变量 y
遵循下面指定的数据生成过程 (DGP)。
$y=a+因子(attrib1)+因子(attrib2)$
其中a为常数:当attrib1
和attrib2
对应参考类别时y=1的概率。贝塔是 attrib1_A= 0.3
attrib1_B= -0.5
最后,我希望创建的变量 y 流过指定的簇内相关性(例如 0.05)。
簇内相关性 是簇间变异性除以簇内和簇间变异性之和。在我们的例子中,集群是 groupid。
有谁知道如何生成具有指定 DGP 和指定集群内相关性的变量?
非常感谢您的帮助
好的,所以这个答案是基于二项式 glmm 的 ICC 的一个简单公式,其中方差固定为 pi^2 / 3
你问题中的第一段代码对我不起作用,正如你想象的那样,收敛对于 0.05 的 ICC 是有问题的。但这是问题的设置。答案使用 broom.mixed
、arm
、performance
和 lme4
包
library(broom.mixed)
b1 <- .3 # first coef
b2 <- -.5 # second coef
# select second stage variance and check ICC
second_stage_var <- 10 # this is the sd
icc <- second_stage_var^2 /(second_stage_var^2 + ( (pi ^ 2) / 3))
icc
### generate data
id <- rep(letters, 50) #ids
id_effect <- rnorm(length(letters), 0, second_stage_var) #id effects
x1 <- rnorm(length(id), 10, 10) # first_covariate <
x2 <- rnorm(length(id), 30, 10) # second covariate <
df <- data.frame(id, id_effect, x1, x2)
probs <- arm::invlogit(1 + b1*x1 + b2*x2 + id_effect)
df$y <- rbinom(length(id), 1, prob = probs)
mod1 <- lme4::glmer(data = df, formula = y ~ x1 + x2 + (1|id), family =binomial())
summary(mod1)
## check icc from the performance package
performance::icc(mod1)
## check icc by "hand"
est <- tidy(mod1)
total <- est$estimate[[4]]^2 + (pi^2/3)
est$estimate[[4]]^2 / total