绘制包含样本大小和功效估计值的图表
Plotting a graph with sample sizes and power estimates
我使用随机生成的身高和体重值模拟了一个线性模型 1000 次,并随机分配每个参与者接受治疗或非治疗(因子 1 和 0)。假设模型是:
lm(bmi~height + weight + treatment, data = df)
我现在正在为以下事情奋斗:
模型现在需要在 300 到 500 之间的样本量之间循环,对于 1000 次重复,每次重复 10 次,并存储 p 值小于 0.05 的模拟实验的比例,以估计可以在 5% 的显着性水平上检测到两个治疗组之间 0.5 的 bmi 变化。
完成上述操作后,我需要创建一个图表,以最好地描述 x 轴上的样本量和 y 轴上的估计功效,并反映实现 80% 的最小样本量不同颜色的功率估计。
知道如何以及从哪里开始吗?
谢谢,
克里斯
我会这样做:
library(dplyr)
library(ggplot2)
# first, encapsulate the steps required to generate one sample of data
# at a given sample size, run the model, and extract the treatment p-value
do_simulate <- function(n) {
# use assumed data generating process to simulate data and add error
data <- tibble(height = rnorm(n, 69, 0.1),
weight = rnorm(n, 197.8, 1.9),
treatment = sample(c(0, 1), n, replace = TRUE),
error = rnorm(n, sd = 1.75),
bmi = 703 * weight / height^2 + 0.5 * treatment + error)
# model the data
mdl <- lm(bmi ~ height + weight + treatment, data = data)
# extract p-value for treatment effect
summary(mdl)[["coefficients"]]["treatment", "Pr(>|t|)"]
}
# second, wrap that single simulation in a replicate so that you can perform
# many simulations at a given sample size and estimate power as the proportion
# of simulations that achieve a significant p-value
simulate_power <- function(n, alpha = 0.05, r = 1000) {
p_values <- replicate(r, do_simulate(n))
power <- mean(p_values < alpha)
return(c(n, power))
}
# third, estimate power at each of your desired
# sample sizes and restructure that data for ggplot
mx <- vapply(seq(300, 500, 10), simulate_power, numeric(2))
plot_data <- tibble(n = mx[1, ],
power = mx[2, ])
# fourth, make a note of the minimum sample size to achieve your desired power
plot_data %>%
filter(power > 0.80) %>%
top_n(-1, n) %>%
pull(n) -> min_n
# finally, construct the plot
ggplot(plot_data, aes(x = n, y = power)) +
geom_smooth(method = "loess", se = FALSE) +
geom_vline(xintercept = min_n)
我使用随机生成的身高和体重值模拟了一个线性模型 1000 次,并随机分配每个参与者接受治疗或非治疗(因子 1 和 0)。假设模型是:
lm(bmi~height + weight + treatment, data = df)
我现在正在为以下事情奋斗:
模型现在需要在 300 到 500 之间的样本量之间循环,对于 1000 次重复,每次重复 10 次,并存储 p 值小于 0.05 的模拟实验的比例,以估计可以在 5% 的显着性水平上检测到两个治疗组之间 0.5 的 bmi 变化。
完成上述操作后,我需要创建一个图表,以最好地描述 x 轴上的样本量和 y 轴上的估计功效,并反映实现 80% 的最小样本量不同颜色的功率估计。
知道如何以及从哪里开始吗?
谢谢, 克里斯
我会这样做:
library(dplyr)
library(ggplot2)
# first, encapsulate the steps required to generate one sample of data
# at a given sample size, run the model, and extract the treatment p-value
do_simulate <- function(n) {
# use assumed data generating process to simulate data and add error
data <- tibble(height = rnorm(n, 69, 0.1),
weight = rnorm(n, 197.8, 1.9),
treatment = sample(c(0, 1), n, replace = TRUE),
error = rnorm(n, sd = 1.75),
bmi = 703 * weight / height^2 + 0.5 * treatment + error)
# model the data
mdl <- lm(bmi ~ height + weight + treatment, data = data)
# extract p-value for treatment effect
summary(mdl)[["coefficients"]]["treatment", "Pr(>|t|)"]
}
# second, wrap that single simulation in a replicate so that you can perform
# many simulations at a given sample size and estimate power as the proportion
# of simulations that achieve a significant p-value
simulate_power <- function(n, alpha = 0.05, r = 1000) {
p_values <- replicate(r, do_simulate(n))
power <- mean(p_values < alpha)
return(c(n, power))
}
# third, estimate power at each of your desired
# sample sizes and restructure that data for ggplot
mx <- vapply(seq(300, 500, 10), simulate_power, numeric(2))
plot_data <- tibble(n = mx[1, ],
power = mx[2, ])
# fourth, make a note of the minimum sample size to achieve your desired power
plot_data %>%
filter(power > 0.80) %>%
top_n(-1, n) %>%
pull(n) -> min_n
# finally, construct the plot
ggplot(plot_data, aes(x = n, y = power)) +
geom_smooth(method = "loess", se = FALSE) +
geom_vline(xintercept = min_n)