如何用 R 确定加权总体比例的置信区间

How to Determine the Confidence Interval for a WEIGHTED Population Proportion with R

任务:计算分类变量表现的加权份额,并找出这些加权份额的置信区间

library(dplyr)
set.seed(100)

用分类变量和权重变量组成数据集:

df <- data.frame(
  Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
  Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)

快速浏览一下数据

head(df, 10)
tail(df, 10)

现在,我可以在不考虑权重的情况下完成任务了:

编写函数 returns 分类变量的一种表现形式的未加权份额及其 95% 置信区间 (有关确定总体比例的置信区间的一般信息,请参见例如: https://www.dummies.com/education/math/statistics/how-to-determine-the-confidence-interval-for-a-population-proportion/)

ci.share <- function(category, manifestation){
  n = length(category)
  share = length(which(category == manifestation)) / n
  se = sqrt (share*(1-share) / n) 

  if(n*share*(1-share) >= 9){
    U <- share - 1.96*se
    O <- share + 1.96*se
    KI <- c(U, share, O)
    names(KI) <- c("lower boundary", "share", "upper boundary")
    KI = KI * 100
    return(KI)
  } else {return("Error, because n*p*(1-p) < 9")}
}

对所有表现形式使用函数并将结果存储在列表中:

cis <- list()
for(i in c("A", "B", "C", "D")){
  cis[[i]] <- ci.share(category = df$Category, manifestation = i)
}

使结果易于阅读:

cis <- t(sapply(cis, "["))
cis <- round(cis, digits = 2)
cis  #TASK DONE

问题是:如何获得 "cis" 考虑权重

的等价物

我能做的是找到加权份额:

ws <- summarise_at(group_by(df, Category), vars(Weight), funs(sum))
ws[,2] <- (ws[,2]/sum(df$Weight)) * 100
names(ws) <- c("Category", "Weighted_Share")
sum(ws$Weighted_Share) # Should be 100, is 100
ws

但是现在如何获得置信区间?

非常感谢您提供解决方案。提前致谢! 安迪

我终于找到了一个非常简单的解决方案来解决我的问题。包 "survey" 正是我想要的:

set.seed(100)
library(survey)

df <- data.frame(
  Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
  Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)

d <- svydesign(id = ~1, weights = ~Weight, data = df)
svymean(~Category, d)

代码生成加权比例(与 "ws" 中的结果相同)和相应的标准误差,这使得计算置信区间变得容易。

为了与基于调查的比例置信区间计算 (CI) 进行比较,我在这里 post 使用中心极限定理计算 CI 的代码(CLT) 对于非相同分布的变量(根据原始发帖者 (OP) 在其评论中的要求):

1) 辅助函数定义

#' Compute survey-based Confidence Intervals
#'
#' @param df data frame with at least one column: "Category".
#' @param design survey design, normally created with \code{svydesign()} of the survey package.
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using the \code{svymean()} function of the survey package on the given \code{design}.
#' @return data frame containing estimated proportions for each "Category" and respective
#' 95% confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_survey <- function(df, design) {
  design.mean = svymean(~Category, design, deff="replace")
  CI = confint( design.mean )
  # Add the estimated proportions and Delta = Semi-size CI for comparison with the CLT-based results below
  CI = cbind( p=design.mean, Delta=( CI[,"97.5 %"] - CI[,"2.5 %"] ) / 2, CI )

  return(as.data.frame(CI))
}

#' Compute CLT-based Confidence Intervals
#'
#' @param df data frame with at least two columns: "Category", "Weight" (sampling weights)
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using Lyapunov's or Lindenberg's version of the Central Limit Theorem (CLT) which
#' applies to independent but NOT identically distributed random variables
#' Ref: https://en.wikipedia.org/wiki/Central_limit_theorem#Lack_of_identical_distribution
#' @return data frame containing estimated proportions for each "Category" and respective
#' \code{ci_level} confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_clt <- function(df, ci_level) {
  ### The following calculations are based on the following setup for each category given in 'df':
  ### 1) Let X(i) be the measurement of variable X for sampled case i, i = 1 ... n (n=500 in this case)
  ### where X is a 0/1 variable indicating absence or presence of a selected category.
  ### From the X(i) samples we would like to estimate the
  ### true proportion p of the presence of the category in the population.
  ### Therefore X(i) are iid random variables with Binomial(1,p) distribution
  ###
  ### 2) Let Y(i) = w(i)*X(i)
  ### where w(i) is the sampling weight applied to variable X(i).
  ###
  ### We apply the CLT to the sum of the Y(i)'s, using:
  ### - E(Y(i)) = mu(i) = w(i) * E(X(i)) = w(i) * p (since w(i) is a constant and the X(i) are identically distributed)
  ### - Var(Y(i)) = sigma2(i) = w(i)^2 * Var(X(i)) = w(i)^2 * p*(1-p) (since the X(i) iid)
  ###
  ### Hence, by CLT:
  ###   Sum{Y(i) - mu(i)} / sigma -> N(0,1)
  ### where:
  ###   sigma = sqrt( Sum{ sigma2(i) } ) = sqrt( Sum{ w(i)^2 } ) * sqrt( p*(1-p) )
  ### and note that:
  ###   Sum{ mu(i) } = Sum{ w(i) } * p = n*p
  ### since the sampling weights are assumed to sum up to the sample size.
  ###
  ### Note: all the Sums are from i = 1, ..., n
  ###
  ### 3) Compute the approximate confidence interval for p based on the N(0,1) distribution
  ### in the usual way, by first estimating sigma replacing p for the estimated p.
  ###

  alpha = 1 - ci_level                                         # area outside the confidence band
  z = qnorm(1 - alpha/2)                                       # critical z-quantile from Normal(0,1)

  n = nrow(df)                                                 # Sample size (assuming no missing values)
  ws = df$Weight / sum(df$Weight) * n                          # Weights scaled to sum the sample size (assumed for sampling weights)
  S = aggregate(ws ~ Category, sum, data=df)                   # Weighted-base estimate of the total by category (Sum{ Y(i) })
  sigma2 = sum( ws^2 )                                         # Sum of squared weights (note that we must NOT sum by category)
  S[,"p"] = S[,"ws"] / n                                       # Estimated proportion by category
  S[,"Delta"] =  z * sqrt( sigma2 ) *
                     sqrt( S$p * (1 - S$p) ) / n               # Semi-size of the CI by category
  LB_name = paste(formatC(alpha/2*100, format="g"), "%")       # Name for the CI's Lower Bound column
  UB_name = paste(formatC((1 - alpha/2)*100, format="g"), "%") # Name for the CI's Upper Bound column
  S[,LB_name] = S[,"p"] - S[,"Delta"]                          # CI's Lower Bound
  S[,UB_name] = S[,"p"] + S[,"Delta"]                          # CI's Upper Bound

  return(S)
}

#' Show the CI with the specified number of significant digits
show_values <- function(values, digits=3) {
  op = options(digits=digits)
  print(values)
  options(op)
}

2) 模拟数据

### Simulated data
set.seed(100)
df <- data.frame(
  Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
  Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)

3) 基于调查CI(供参考)

# Computation of CI using survey sampling theory (implemented in the survey package)
library(survey)
design <- svydesign(ids = ~1, weights = ~Weight, data = df)
CI_survey = ci_survey(df, design)
show_values(CI_survey)

给出:

               p  Delta  2.5 % 97.5 %
CategoryA 0.0896 0.0268 0.0628  0.116
CategoryB 0.2119 0.0417 0.1702  0.254
CategoryC 0.2824 0.0445 0.2379  0.327
CategoryD 0.4161 0.0497 0.3664  0.466

4) 基于 CLT CI

所用方法的描述包含在上面定义的 ci_clt() 函数顶部的注释中。

# Computation of CI using the Central Limit Theorem for non-identically distributed variables
CI_clt = ci_clt(df, ci_level=0.95)
show_values(CI_clt)

给出:

  Category    ws      p  Delta 2.5 % 97.5 %
1        A  44.8 0.0896 0.0286 0.061  0.118
2        B 106.0 0.2119 0.0410 0.171  0.253
3        C 141.2 0.2824 0.0451 0.237  0.328
4        D 208.0 0.4161 0.0494 0.367  0.465

5) CI 尺寸比较

这里我们计算基于 CLT 的 CIs 和基于调查的 CIs 之间的比率。

# Comparison of CI size
show_values(
  data.frame(Category=CI_clt[,"Category"],
             ratio_DeltaCI_clt2survey=CI_clt[,"Delta"] / CI_survey[,"Delta"])
)

给出:

  Category ratio_DeltaCI_clt2survey
1        A                    1.067
2        B                    0.982
3        C                    1.013
4        D                    0.994

由于所有比率都接近于 1,我们得出结论,这两种方法的 CI 大小非常相似!

6) 检查 CI 基于 CLT 的实现是否正确

对 CIs 的基于 CLT 的计算执行的一个方便的检查是 运行 简单随机样本 (SRS) 案例的计算,并验证结果是否与那些一致SRS设计下的svymean()计算给出。

# CLT-based calculation
df_noweights = df
df_noweights$Weight = 1                               # SRS: weights equal to 1
show_values( ci_clt(df_noweights, ci_level=0.95) )

给出:

  Category   p  Delta  2.5 % 97.5 %
1        A 0.1 0.0263 0.0737  0.126
2        B 0.2 0.0351 0.1649  0.235
3        C 0.3 0.0402 0.2598  0.340
4        D 0.4 0.0429 0.3571  0.443

我们将其与基于调查的计算进行比较:

# Survey-based calculation
design <- svydesign(ids=~1, probs=NULL, data=df)
show_values( ci_survey(df, design) )

给出:

            p  Delta  2.5 % 97.5 %
CategoryA 0.1 0.0263 0.0737  0.126
CategoryB 0.2 0.0351 0.1649  0.235
CategoryC 0.3 0.0402 0.2598  0.340
CategoryD 0.4 0.0430 0.3570  0.443

我们看到结果一致,表明在一般情况下权重不等的CI的基于CLT的计算的实现是正确的。