使用 statsmodel 的 pandas 时间序列中单个比例的 95% 置信区间

Question

我有一个时间序列数据框：

df = pd.DataFrame({'year':['2010','2011','2012','2013','2014','2015','2016','2017','2018','2019'],
                       'total_count': [545,779,706,547,626,530,766,1235,1260,947], 
                       'rand_count':[96,184,148,154,160,149,124,274,322,301],
                       'rand_perc':[17.61,23.62,20.96,28.15,25.56,28.11,16.19,22.19,25.56,31.78]
                       })

这里；

df['rand_perc'] = (df['rand_count']/df['total_count'])*100

问题:

我想计算 df['total_count'] 中 df['rand_count'] 的单个比例的置信区间，在 df 的每一行中并绘制 df['year'] 与 df['rand_perc'] CI 作为误差线。我尝试使用 statsmodel 使用以下代码为每一行计算 CI：

import statsmodels.api as sm

df['CI'] =  df[['total_count', 'rand_count']].apply(lambda row: sm.stats.proportion_confint(count = 
df['rand_count'], nobs = df['total_count'], alpha = 0.05), axis = 1)

但是结果 df['CI'] 看起来非常讨厌每行中所有 CI 的元组作为;

0    ([0.14416430990026746, 0.2063732756491498, 0.1...
1    ([0.14416430990026746, 0.2063732756491498, 0.1...
2    ([0.14416430990026746, 0.2063732756491498, 0.1...
3    ([0.14416430990026746, 0.2063732756491498, 0.1...
4    ([0.14416430990026746, 0.2063732756491498, 0.1...
5    ([0.14416430990026746, 0.2063732756491498, 0.1...
6    ([0.14416430990026746, 0.2063732756491498, 0.1...
7    ([0.14416430990026746, 0.2063732756491498, 0.1...
8    ([0.14416430990026746, 0.2063732756491498, 0.1...
9    ([0.14416430990026746, 0.2063732756491498, 0.1...
Name: CI, dtype: object

想要的结果

df['CI']每行两个元素各自的元组，如：

(0.144164, 0.206373)
(0.179606, 0.243846)
(0.221421, 0.242859)
...................

还有两个单独的列 df[upper] 和 df[lower] 分别表示 df['CI'] 的上限和下限。

非常感谢您的帮助。

非常感谢！

Answer 1

考虑分配多个列，这些列应按索引排列，因为根据 docs:

When a pandas object is returned, then the index is taken from the count.

df['lower_CI'], df['upper_CI'] =  sm.stats.proportion_confint(
                                      count = df['rand_count'],
                                      nobs = df['total_count'],
                                      alpha = 0.05
                                  )

使用 statsmodel 的 pandas 时间序列中单个比例的 95% 置信区间

95% confidence interval of single proportion in pandas timeseries using statsmodel

python

time-series

pandas

statsmodels