绘制包含样本大小和功效估计值的图表

Plotting a graph with sample sizes and power estimates

我使用随机生成的身高和体重值模拟了一个线性模型 1000 次,并随机分配每个参与者接受治疗或非治疗(因子 1 和 0)。假设模型是:

lm(bmi~height + weight + treatment, data = df)

我现在正在为以下事情奋斗:

模型现在需要在 300 到 500 之间的样本量之间循环,对于 1000 次重复,每次重复 10 次,并存储 p 值小于 0.05 的模拟实验的比例,以估计可以在 5% 的显着性水平上检测到两个治疗组之间 0.5 的 bmi 变化。

完成上述操作后,我需要创建一个图表,以最好地描述 x 轴上的样本量和 y 轴上的估计功效,并反映实现 80% 的最小样本量不同颜色的功率估计。

知道如何以及从哪里开始吗?

谢谢, 克里斯

我会这样做:

library(dplyr)
library(ggplot2)

# first, encapsulate the steps required to generate one sample of data
# at a given sample size, run the model, and extract the treatment p-value
do_simulate <- function(n) {
  # use assumed data generating process to simulate data and add error
  data <- tibble(height = rnorm(n, 69, 0.1), 
                 weight = rnorm(n, 197.8, 1.9), 
                 treatment = sample(c(0, 1), n, replace = TRUE),
                 error = rnorm(n, sd = 1.75),
                 bmi = 703 * weight / height^2 + 0.5 * treatment + error)

  # model the data
  mdl <- lm(bmi ~ height + weight + treatment, data = data)

  # extract p-value for treatment effect
  summary(mdl)[["coefficients"]]["treatment", "Pr(>|t|)"]
}

# second, wrap that single simulation in a replicate so that you can perform
# many simulations at a given sample size and estimate power as the proportion
# of simulations that achieve a significant p-value
simulate_power <- function(n, alpha = 0.05, r = 1000) {
  p_values <- replicate(r, do_simulate(n))
  power <- mean(p_values < alpha)
  return(c(n, power))
}

# third, estimate power at each of your desired 
# sample sizes and restructure that data for ggplot
mx <- vapply(seq(300, 500, 10), simulate_power, numeric(2))
plot_data <- tibble(n = mx[1, ], 
                    power = mx[2, ])

# fourth, make a note of the minimum sample size to achieve your desired power
plot_data %>% 
  filter(power > 0.80) %>% 
  top_n(-1, n) %>% 
  pull(n) -> min_n

# finally, construct the plot
ggplot(plot_data, aes(x = n, y = power)) + 
  geom_smooth(method = "loess", se = FALSE) + 
  geom_vline(xintercept = min_n)