如何使用 Python 计算 Difference in Differences 方法的置信区间?
How to compute the confidence interval of the Difference in Differences method using Python?
我正在尝试分析实验前后每位用户的总活跃分钟数。在这里,我包括了相关的 user data before and after the experiment - variant_number = 0 表示对照组,而 1 表示治疗组。具体来说,我对平均值(每位用户的平均总活跃分钟数)感兴趣。
首先,我计算了治疗结果的前后差异和对照结果的前后差异(分别为-183.7 和19.4)。本例中差值之差=203.1。
我想知道如何使用 Python 构建差异差异的 95% 置信区间? (如果需要我可以提供更多code/context)
您可以使用线性模型并测量交互作用(下图group[T.1]:period[T.pre]
)。这些模拟数据的平均差异为 -223.1779
,交互作用的 p 值是 p < 5e-4 非常显着,95% 置信区间是 [-276.360, -169.995]
.
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
np.random.seed(14)
minutes_0_pre = np.random.normal(loc=478, scale=1821, size=39776)
minutes_1_pre = np.random.normal(loc=275, scale=1078, size=9921)
minutes_0_post = np.random.normal(loc=458, scale=1653, size=37425)
minutes_1_post = np.random.normal(loc=458, scale=1681, size=9208)
df = pd.DataFrame({'minutes': np.concatenate((minutes_0_pre, minutes_1_pre, minutes_0_post, minutes_1_post)),
'group': np.concatenate((np.repeat(a='0', repeats=minutes_0_pre.size),
np.repeat(a='1', repeats=minutes_1_pre.size),
np.repeat(a='0', repeats=minutes_0_post.size),
np.repeat(a='1', repeats=minutes_1_post.size))),
'period': np.concatenate((np.repeat(a='pre', repeats=minutes_0_pre.size + minutes_1_pre.size),
np.repeat(a='post', repeats=minutes_0_post.size + minutes_1_post.size)))
})
model = smf.glm('minutes ~ group * period', df, family=sm.families.Gaussian()).fit()
print(model.summary())
输出:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: minutes No. Observations: 96330
Model: GLM Df Residuals: 96326
Model Family: Gaussian Df Model: 3
Link Function: identity Scale: 2.8182e+06
Method: IRLS Log-Likelihood: -8.5201e+05
Date: Mon, 18 Jan 2021 Deviance: 2.7147e+11
Time: 23:05:53 Pearson chi2: 2.71e+11
No. Iterations: 3
Covariance Type: nonrobust
============================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept 456.2792 8.678 52.581 0.000 439.271 473.287
group[T.1] 14.9314 19.529 0.765 0.445 -23.344 53.207
period[T.pre] 21.7417 12.089 1.798 0.072 -1.953 45.437
group[T.1]:period[T.pre] -223.1779 27.134 -8.225 0.000 -276.360 -169.995
============================================================================================
编辑:
由于您的汇总统计数据显示您的分布严重偏斜,自举法实际上是一种更可靠的估计置信区间的方法:
r = 1000
bootstrap = np.zeros(r)
for i in range(0, r):
sample_index = np.random.choice(a=range(0, df.shape[0]), size=df.shape[0], replace=True)
df_sample = df.iloc[sample_index]
model = smf.glm('minutes ~ group * period', df_sample, family=sm.families.Gaussian()).fit()
bootstrap[i] = model.params.iloc[3] # interaction
bootstrap = pd.DataFrame(bootstrap, columns=['interaction'])
print(bootstrap.quantile([0.025, 0.975]).T)
输出:
0.025 0.975
interaction -273.524899 -175.373177
我正在尝试分析实验前后每位用户的总活跃分钟数。在这里,我包括了相关的 user data before and after the experiment - variant_number = 0 表示对照组,而 1 表示治疗组。具体来说,我对平均值(每位用户的平均总活跃分钟数)感兴趣。
首先,我计算了治疗结果的前后差异和对照结果的前后差异(分别为-183.7 和19.4)。本例中差值之差=203.1。
我想知道如何使用 Python 构建差异差异的 95% 置信区间? (如果需要我可以提供更多code/context)
您可以使用线性模型并测量交互作用(下图group[T.1]:period[T.pre]
)。这些模拟数据的平均差异为 -223.1779
,交互作用的 p 值是 p < 5e-4 非常显着,95% 置信区间是 [-276.360, -169.995]
.
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
np.random.seed(14)
minutes_0_pre = np.random.normal(loc=478, scale=1821, size=39776)
minutes_1_pre = np.random.normal(loc=275, scale=1078, size=9921)
minutes_0_post = np.random.normal(loc=458, scale=1653, size=37425)
minutes_1_post = np.random.normal(loc=458, scale=1681, size=9208)
df = pd.DataFrame({'minutes': np.concatenate((minutes_0_pre, minutes_1_pre, minutes_0_post, minutes_1_post)),
'group': np.concatenate((np.repeat(a='0', repeats=minutes_0_pre.size),
np.repeat(a='1', repeats=minutes_1_pre.size),
np.repeat(a='0', repeats=minutes_0_post.size),
np.repeat(a='1', repeats=minutes_1_post.size))),
'period': np.concatenate((np.repeat(a='pre', repeats=minutes_0_pre.size + minutes_1_pre.size),
np.repeat(a='post', repeats=minutes_0_post.size + minutes_1_post.size)))
})
model = smf.glm('minutes ~ group * period', df, family=sm.families.Gaussian()).fit()
print(model.summary())
输出:
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: minutes No. Observations: 96330
Model: GLM Df Residuals: 96326
Model Family: Gaussian Df Model: 3
Link Function: identity Scale: 2.8182e+06
Method: IRLS Log-Likelihood: -8.5201e+05
Date: Mon, 18 Jan 2021 Deviance: 2.7147e+11
Time: 23:05:53 Pearson chi2: 2.71e+11
No. Iterations: 3
Covariance Type: nonrobust
============================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------
Intercept 456.2792 8.678 52.581 0.000 439.271 473.287
group[T.1] 14.9314 19.529 0.765 0.445 -23.344 53.207
period[T.pre] 21.7417 12.089 1.798 0.072 -1.953 45.437
group[T.1]:period[T.pre] -223.1779 27.134 -8.225 0.000 -276.360 -169.995
============================================================================================
编辑:
由于您的汇总统计数据显示您的分布严重偏斜,自举法实际上是一种更可靠的估计置信区间的方法:
r = 1000
bootstrap = np.zeros(r)
for i in range(0, r):
sample_index = np.random.choice(a=range(0, df.shape[0]), size=df.shape[0], replace=True)
df_sample = df.iloc[sample_index]
model = smf.glm('minutes ~ group * period', df_sample, family=sm.families.Gaussian()).fit()
bootstrap[i] = model.params.iloc[3] # interaction
bootstrap = pd.DataFrame(bootstrap, columns=['interaction'])
print(bootstrap.quantile([0.025, 0.975]).T)
输出:
0.025 0.975
interaction -273.524899 -175.373177