如何用 R 确定加权总体比例的置信区间
How to Determine the Confidence Interval for a WEIGHTED Population Proportion with R
任务:计算分类变量表现的加权份额,并找出这些加权份额的置信区间
library(dplyr)
set.seed(100)
用分类变量和权重变量组成数据集:
df <- data.frame(
Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)
快速浏览一下数据
head(df, 10)
tail(df, 10)
现在,我可以在不考虑权重的情况下完成任务了:
编写函数 returns 分类变量的一种表现形式的未加权份额及其 95% 置信区间
(有关确定总体比例的置信区间的一般信息,请参见例如:
https://www.dummies.com/education/math/statistics/how-to-determine-the-confidence-interval-for-a-population-proportion/)
ci.share <- function(category, manifestation){
n = length(category)
share = length(which(category == manifestation)) / n
se = sqrt (share*(1-share) / n)
if(n*share*(1-share) >= 9){
U <- share - 1.96*se
O <- share + 1.96*se
KI <- c(U, share, O)
names(KI) <- c("lower boundary", "share", "upper boundary")
KI = KI * 100
return(KI)
} else {return("Error, because n*p*(1-p) < 9")}
}
对所有表现形式使用函数并将结果存储在列表中:
cis <- list()
for(i in c("A", "B", "C", "D")){
cis[[i]] <- ci.share(category = df$Category, manifestation = i)
}
使结果易于阅读:
cis <- t(sapply(cis, "["))
cis <- round(cis, digits = 2)
cis #TASK DONE
问题是:如何获得 "cis" 考虑权重
的等价物
我能做的是找到加权份额:
ws <- summarise_at(group_by(df, Category), vars(Weight), funs(sum))
ws[,2] <- (ws[,2]/sum(df$Weight)) * 100
names(ws) <- c("Category", "Weighted_Share")
sum(ws$Weighted_Share) # Should be 100, is 100
ws
但是现在如何获得置信区间?
非常感谢您提供解决方案。提前致谢!
安迪
我终于找到了一个非常简单的解决方案来解决我的问题。包 "survey" 正是我想要的:
set.seed(100)
library(survey)
df <- data.frame(
Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)
d <- svydesign(id = ~1, weights = ~Weight, data = df)
svymean(~Category, d)
代码生成加权比例(与 "ws" 中的结果相同)和相应的标准误差,这使得计算置信区间变得容易。
为了与基于调查的比例置信区间计算 (CI) 进行比较,我在这里 post 使用中心极限定理计算 CI 的代码(CLT) 对于非相同分布的变量(根据原始发帖者 (OP) 在其评论中的要求):
1) 辅助函数定义
#' Compute survey-based Confidence Intervals
#'
#' @param df data frame with at least one column: "Category".
#' @param design survey design, normally created with \code{svydesign()} of the survey package.
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using the \code{svymean()} function of the survey package on the given \code{design}.
#' @return data frame containing estimated proportions for each "Category" and respective
#' 95% confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_survey <- function(df, design) {
design.mean = svymean(~Category, design, deff="replace")
CI = confint( design.mean )
# Add the estimated proportions and Delta = Semi-size CI for comparison with the CLT-based results below
CI = cbind( p=design.mean, Delta=( CI[,"97.5 %"] - CI[,"2.5 %"] ) / 2, CI )
return(as.data.frame(CI))
}
#' Compute CLT-based Confidence Intervals
#'
#' @param df data frame with at least two columns: "Category", "Weight" (sampling weights)
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using Lyapunov's or Lindenberg's version of the Central Limit Theorem (CLT) which
#' applies to independent but NOT identically distributed random variables
#' Ref: https://en.wikipedia.org/wiki/Central_limit_theorem#Lack_of_identical_distribution
#' @return data frame containing estimated proportions for each "Category" and respective
#' \code{ci_level} confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_clt <- function(df, ci_level) {
### The following calculations are based on the following setup for each category given in 'df':
### 1) Let X(i) be the measurement of variable X for sampled case i, i = 1 ... n (n=500 in this case)
### where X is a 0/1 variable indicating absence or presence of a selected category.
### From the X(i) samples we would like to estimate the
### true proportion p of the presence of the category in the population.
### Therefore X(i) are iid random variables with Binomial(1,p) distribution
###
### 2) Let Y(i) = w(i)*X(i)
### where w(i) is the sampling weight applied to variable X(i).
###
### We apply the CLT to the sum of the Y(i)'s, using:
### - E(Y(i)) = mu(i) = w(i) * E(X(i)) = w(i) * p (since w(i) is a constant and the X(i) are identically distributed)
### - Var(Y(i)) = sigma2(i) = w(i)^2 * Var(X(i)) = w(i)^2 * p*(1-p) (since the X(i) iid)
###
### Hence, by CLT:
### Sum{Y(i) - mu(i)} / sigma -> N(0,1)
### where:
### sigma = sqrt( Sum{ sigma2(i) } ) = sqrt( Sum{ w(i)^2 } ) * sqrt( p*(1-p) )
### and note that:
### Sum{ mu(i) } = Sum{ w(i) } * p = n*p
### since the sampling weights are assumed to sum up to the sample size.
###
### Note: all the Sums are from i = 1, ..., n
###
### 3) Compute the approximate confidence interval for p based on the N(0,1) distribution
### in the usual way, by first estimating sigma replacing p for the estimated p.
###
alpha = 1 - ci_level # area outside the confidence band
z = qnorm(1 - alpha/2) # critical z-quantile from Normal(0,1)
n = nrow(df) # Sample size (assuming no missing values)
ws = df$Weight / sum(df$Weight) * n # Weights scaled to sum the sample size (assumed for sampling weights)
S = aggregate(ws ~ Category, sum, data=df) # Weighted-base estimate of the total by category (Sum{ Y(i) })
sigma2 = sum( ws^2 ) # Sum of squared weights (note that we must NOT sum by category)
S[,"p"] = S[,"ws"] / n # Estimated proportion by category
S[,"Delta"] = z * sqrt( sigma2 ) *
sqrt( S$p * (1 - S$p) ) / n # Semi-size of the CI by category
LB_name = paste(formatC(alpha/2*100, format="g"), "%") # Name for the CI's Lower Bound column
UB_name = paste(formatC((1 - alpha/2)*100, format="g"), "%") # Name for the CI's Upper Bound column
S[,LB_name] = S[,"p"] - S[,"Delta"] # CI's Lower Bound
S[,UB_name] = S[,"p"] + S[,"Delta"] # CI's Upper Bound
return(S)
}
#' Show the CI with the specified number of significant digits
show_values <- function(values, digits=3) {
op = options(digits=digits)
print(values)
options(op)
}
2) 模拟数据
### Simulated data
set.seed(100)
df <- data.frame(
Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)
3) 基于调查CI(供参考)
# Computation of CI using survey sampling theory (implemented in the survey package)
library(survey)
design <- svydesign(ids = ~1, weights = ~Weight, data = df)
CI_survey = ci_survey(df, design)
show_values(CI_survey)
给出:
p Delta 2.5 % 97.5 %
CategoryA 0.0896 0.0268 0.0628 0.116
CategoryB 0.2119 0.0417 0.1702 0.254
CategoryC 0.2824 0.0445 0.2379 0.327
CategoryD 0.4161 0.0497 0.3664 0.466
4) 基于 CLT CI
所用方法的描述包含在上面定义的 ci_clt()
函数顶部的注释中。
# Computation of CI using the Central Limit Theorem for non-identically distributed variables
CI_clt = ci_clt(df, ci_level=0.95)
show_values(CI_clt)
给出:
Category ws p Delta 2.5 % 97.5 %
1 A 44.8 0.0896 0.0286 0.061 0.118
2 B 106.0 0.2119 0.0410 0.171 0.253
3 C 141.2 0.2824 0.0451 0.237 0.328
4 D 208.0 0.4161 0.0494 0.367 0.465
5) CI 尺寸比较
这里我们计算基于 CLT 的 CIs 和基于调查的 CIs 之间的比率。
# Comparison of CI size
show_values(
data.frame(Category=CI_clt[,"Category"],
ratio_DeltaCI_clt2survey=CI_clt[,"Delta"] / CI_survey[,"Delta"])
)
给出:
Category ratio_DeltaCI_clt2survey
1 A 1.067
2 B 0.982
3 C 1.013
4 D 0.994
由于所有比率都接近于 1,我们得出结论,这两种方法的 CI 大小非常相似!
6) 检查 CI 基于 CLT 的实现是否正确
对 CIs 的基于 CLT 的计算执行的一个方便的检查是 运行 简单随机样本 (SRS) 案例的计算,并验证结果是否与那些一致SRS设计下的svymean()
计算给出。
# CLT-based calculation
df_noweights = df
df_noweights$Weight = 1 # SRS: weights equal to 1
show_values( ci_clt(df_noweights, ci_level=0.95) )
给出:
Category p Delta 2.5 % 97.5 %
1 A 0.1 0.0263 0.0737 0.126
2 B 0.2 0.0351 0.1649 0.235
3 C 0.3 0.0402 0.2598 0.340
4 D 0.4 0.0429 0.3571 0.443
我们将其与基于调查的计算进行比较:
# Survey-based calculation
design <- svydesign(ids=~1, probs=NULL, data=df)
show_values( ci_survey(df, design) )
给出:
p Delta 2.5 % 97.5 %
CategoryA 0.1 0.0263 0.0737 0.126
CategoryB 0.2 0.0351 0.1649 0.235
CategoryC 0.3 0.0402 0.2598 0.340
CategoryD 0.4 0.0430 0.3570 0.443
我们看到结果一致,表明在一般情况下权重不等的CI的基于CLT的计算的实现是正确的。
任务:计算分类变量表现的加权份额,并找出这些加权份额的置信区间
library(dplyr)
set.seed(100)
用分类变量和权重变量组成数据集:
df <- data.frame(
Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)
快速浏览一下数据
head(df, 10)
tail(df, 10)
现在,我可以在不考虑权重的情况下完成任务了:
编写函数 returns 分类变量的一种表现形式的未加权份额及其 95% 置信区间 (有关确定总体比例的置信区间的一般信息,请参见例如: https://www.dummies.com/education/math/statistics/how-to-determine-the-confidence-interval-for-a-population-proportion/)
ci.share <- function(category, manifestation){
n = length(category)
share = length(which(category == manifestation)) / n
se = sqrt (share*(1-share) / n)
if(n*share*(1-share) >= 9){
U <- share - 1.96*se
O <- share + 1.96*se
KI <- c(U, share, O)
names(KI) <- c("lower boundary", "share", "upper boundary")
KI = KI * 100
return(KI)
} else {return("Error, because n*p*(1-p) < 9")}
}
对所有表现形式使用函数并将结果存储在列表中:
cis <- list()
for(i in c("A", "B", "C", "D")){
cis[[i]] <- ci.share(category = df$Category, manifestation = i)
}
使结果易于阅读:
cis <- t(sapply(cis, "["))
cis <- round(cis, digits = 2)
cis #TASK DONE
问题是:如何获得 "cis" 考虑权重
的等价物我能做的是找到加权份额:
ws <- summarise_at(group_by(df, Category), vars(Weight), funs(sum))
ws[,2] <- (ws[,2]/sum(df$Weight)) * 100
names(ws) <- c("Category", "Weighted_Share")
sum(ws$Weighted_Share) # Should be 100, is 100
ws
但是现在如何获得置信区间?
非常感谢您提供解决方案。提前致谢! 安迪
我终于找到了一个非常简单的解决方案来解决我的问题。包 "survey" 正是我想要的:
set.seed(100)
library(survey)
df <- data.frame(
Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)
d <- svydesign(id = ~1, weights = ~Weight, data = df)
svymean(~Category, d)
代码生成加权比例(与 "ws" 中的结果相同)和相应的标准误差,这使得计算置信区间变得容易。
为了与基于调查的比例置信区间计算 (CI) 进行比较,我在这里 post 使用中心极限定理计算 CI 的代码(CLT) 对于非相同分布的变量(根据原始发帖者 (OP) 在其评论中的要求):
1) 辅助函数定义
#' Compute survey-based Confidence Intervals
#'
#' @param df data frame with at least one column: "Category".
#' @param design survey design, normally created with \code{svydesign()} of the survey package.
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using the \code{svymean()} function of the survey package on the given \code{design}.
#' @return data frame containing estimated proportions for each "Category" and respective
#' 95% confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_survey <- function(df, design) {
design.mean = svymean(~Category, design, deff="replace")
CI = confint( design.mean )
# Add the estimated proportions and Delta = Semi-size CI for comparison with the CLT-based results below
CI = cbind( p=design.mean, Delta=( CI[,"97.5 %"] - CI[,"2.5 %"] ) / 2, CI )
return(as.data.frame(CI))
}
#' Compute CLT-based Confidence Intervals
#'
#' @param df data frame with at least two columns: "Category", "Weight" (sampling weights)
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using Lyapunov's or Lindenberg's version of the Central Limit Theorem (CLT) which
#' applies to independent but NOT identically distributed random variables
#' Ref: https://en.wikipedia.org/wiki/Central_limit_theorem#Lack_of_identical_distribution
#' @return data frame containing estimated proportions for each "Category" and respective
#' \code{ci_level} confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_clt <- function(df, ci_level) {
### The following calculations are based on the following setup for each category given in 'df':
### 1) Let X(i) be the measurement of variable X for sampled case i, i = 1 ... n (n=500 in this case)
### where X is a 0/1 variable indicating absence or presence of a selected category.
### From the X(i) samples we would like to estimate the
### true proportion p of the presence of the category in the population.
### Therefore X(i) are iid random variables with Binomial(1,p) distribution
###
### 2) Let Y(i) = w(i)*X(i)
### where w(i) is the sampling weight applied to variable X(i).
###
### We apply the CLT to the sum of the Y(i)'s, using:
### - E(Y(i)) = mu(i) = w(i) * E(X(i)) = w(i) * p (since w(i) is a constant and the X(i) are identically distributed)
### - Var(Y(i)) = sigma2(i) = w(i)^2 * Var(X(i)) = w(i)^2 * p*(1-p) (since the X(i) iid)
###
### Hence, by CLT:
### Sum{Y(i) - mu(i)} / sigma -> N(0,1)
### where:
### sigma = sqrt( Sum{ sigma2(i) } ) = sqrt( Sum{ w(i)^2 } ) * sqrt( p*(1-p) )
### and note that:
### Sum{ mu(i) } = Sum{ w(i) } * p = n*p
### since the sampling weights are assumed to sum up to the sample size.
###
### Note: all the Sums are from i = 1, ..., n
###
### 3) Compute the approximate confidence interval for p based on the N(0,1) distribution
### in the usual way, by first estimating sigma replacing p for the estimated p.
###
alpha = 1 - ci_level # area outside the confidence band
z = qnorm(1 - alpha/2) # critical z-quantile from Normal(0,1)
n = nrow(df) # Sample size (assuming no missing values)
ws = df$Weight / sum(df$Weight) * n # Weights scaled to sum the sample size (assumed for sampling weights)
S = aggregate(ws ~ Category, sum, data=df) # Weighted-base estimate of the total by category (Sum{ Y(i) })
sigma2 = sum( ws^2 ) # Sum of squared weights (note that we must NOT sum by category)
S[,"p"] = S[,"ws"] / n # Estimated proportion by category
S[,"Delta"] = z * sqrt( sigma2 ) *
sqrt( S$p * (1 - S$p) ) / n # Semi-size of the CI by category
LB_name = paste(formatC(alpha/2*100, format="g"), "%") # Name for the CI's Lower Bound column
UB_name = paste(formatC((1 - alpha/2)*100, format="g"), "%") # Name for the CI's Upper Bound column
S[,LB_name] = S[,"p"] - S[,"Delta"] # CI's Lower Bound
S[,UB_name] = S[,"p"] + S[,"Delta"] # CI's Upper Bound
return(S)
}
#' Show the CI with the specified number of significant digits
show_values <- function(values, digits=3) {
op = options(digits=digits)
print(values)
options(op)
}
2) 模拟数据
### Simulated data
set.seed(100)
df <- data.frame(
Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)
3) 基于调查CI(供参考)
# Computation of CI using survey sampling theory (implemented in the survey package)
library(survey)
design <- svydesign(ids = ~1, weights = ~Weight, data = df)
CI_survey = ci_survey(df, design)
show_values(CI_survey)
给出:
p Delta 2.5 % 97.5 %
CategoryA 0.0896 0.0268 0.0628 0.116
CategoryB 0.2119 0.0417 0.1702 0.254
CategoryC 0.2824 0.0445 0.2379 0.327
CategoryD 0.4161 0.0497 0.3664 0.466
4) 基于 CLT CI
所用方法的描述包含在上面定义的 ci_clt()
函数顶部的注释中。
# Computation of CI using the Central Limit Theorem for non-identically distributed variables
CI_clt = ci_clt(df, ci_level=0.95)
show_values(CI_clt)
给出:
Category ws p Delta 2.5 % 97.5 %
1 A 44.8 0.0896 0.0286 0.061 0.118
2 B 106.0 0.2119 0.0410 0.171 0.253
3 C 141.2 0.2824 0.0451 0.237 0.328
4 D 208.0 0.4161 0.0494 0.367 0.465
5) CI 尺寸比较
这里我们计算基于 CLT 的 CIs 和基于调查的 CIs 之间的比率。
# Comparison of CI size
show_values(
data.frame(Category=CI_clt[,"Category"],
ratio_DeltaCI_clt2survey=CI_clt[,"Delta"] / CI_survey[,"Delta"])
)
给出:
Category ratio_DeltaCI_clt2survey
1 A 1.067
2 B 0.982
3 C 1.013
4 D 0.994
由于所有比率都接近于 1,我们得出结论,这两种方法的 CI 大小非常相似!
6) 检查 CI 基于 CLT 的实现是否正确
对 CIs 的基于 CLT 的计算执行的一个方便的检查是 运行 简单随机样本 (SRS) 案例的计算,并验证结果是否与那些一致SRS设计下的svymean()
计算给出。
# CLT-based calculation
df_noweights = df
df_noweights$Weight = 1 # SRS: weights equal to 1
show_values( ci_clt(df_noweights, ci_level=0.95) )
给出:
Category p Delta 2.5 % 97.5 %
1 A 0.1 0.0263 0.0737 0.126
2 B 0.2 0.0351 0.1649 0.235
3 C 0.3 0.0402 0.2598 0.340
4 D 0.4 0.0429 0.3571 0.443
我们将其与基于调查的计算进行比较:
# Survey-based calculation
design <- svydesign(ids=~1, probs=NULL, data=df)
show_values( ci_survey(df, design) )
给出:
p Delta 2.5 % 97.5 %
CategoryA 0.1 0.0263 0.0737 0.126
CategoryB 0.2 0.0351 0.1649 0.235
CategoryC 0.3 0.0402 0.2598 0.340
CategoryD 0.4 0.0430 0.3570 0.443
我们看到结果一致,表明在一般情况下权重不等的CI的基于CLT的计算的实现是正确的。