如何统计比较Python中两个不同线性回归模型的截距和斜率?
How to statistically compare the intercept and slope of two different linear regression models in Python?
我有以下两个系列的数据。我想为 df1
创建一个 OLS 线性回归模型,为 df2
创建另一个 OLS 线性回归模型。然后统计检验这两个线性回归模型的y轴截距是否具有统计差异(p<0.05),同时检验这两个线性回归模型的斜率是否具有统计差异(p<0.05)。我做了以下
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
np.inf == float('inf')
data1 = [1, 3, 45, 0, 25, 13, 43]
data2 = [1, 1, 1, 1, 1, 1, 1]
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
fig, ax = plt.subplots()
df1.plot(figsize=(20, 10), linewidth=5, fontsize=18, ax=ax, kind='line')
df2.plot(figsize=(20, 10), linewidth=5, fontsize=18, ax=ax, kind='line')
plt.show()
model1 = sm.OLS(df1, df1.index)
model2 = sm.OLS(df2, df2.index)
results1 = model1.fit()
results2 = model2.fit()
print(results1.summary())
print(results2.summary())
结果#1
OLS Regression Results
=======================================================================================
Dep. Variable: 0 R-squared (uncentered): 0.625
Model: OLS Adj. R-squared (uncentered): 0.563
Method: Least Squares F-statistic: 10.02
Date: Mon, 01 Mar 2021 Prob (F-statistic): 0.0194
Time: 20:34:34 Log-Likelihood: -29.262
No. Observations: 7 AIC: 60.52
Df Residuals: 6 BIC: 60.47
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 5.6703 1.791 3.165 0.019 1.287 10.054
==============================================================================
Omnibus: nan Durbin-Watson: 2.956
Prob(Omnibus): nan Jarque-Bera (JB): 0.769
Skew: 0.811 Prob(JB): 0.681
Kurtosis: 2.943 Cond. No. 1.00
==============================================================================
结果 #2
OLS Regression Results
=======================================================================================
Dep. Variable: 0 R-squared (uncentered): 0.692
Model: OLS Adj. R-squared (uncentered): 0.641
Method: Least Squares F-statistic: 13.50
Date: Mon, 01 Mar 2021 Prob (F-statistic): 0.0104
Time: 20:39:14 Log-Likelihood: -5.8073
No. Observations: 7 AIC: 13.61
Df Residuals: 6 BIC: 13.56
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.2308 0.063 3.674 0.010 0.077 0.384
==============================================================================
Omnibus: nan Durbin-Watson: 0.148
Prob(Omnibus): nan Jarque-Bera (JB): 0.456
Skew: 0.000 Prob(JB): 0.796
Kurtosis: 1.750 Cond. No. 1.00
==============================================================================
这是我目前所了解的,但我认为有问题。这些回归结果似乎都没有显示 y 截距。另外,我希望结果 #2 中的 coef
为 0,因为我希望当所有值都为 1 时斜率为 0,但结果显示 0.2308
。任何建议或指导 material 将不胜感激。
在 statsmodels 中,默认情况下 OLS 模型不适合截距 (see the docs)。
exog array_like
A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
关于 OLS 构造函数的 exog 参数的文档建议使用工具模块的 this feature 以便向数据添加截距。
对系数值执行假设检验 this question 提供了一些指导。不幸的是,这仅在残差的方差相同时才有效。
我们可以先看看每个分布的残差是否具有相同的方差(使用 Levine 检验),暂时忽略回归模型的系数。
import numpy as np
import pandas as pd
from scipy.stats import levene
from statsmodels.tools import add_constant
from statsmodels.formula.api import ols ## use formula api to make the tests easier
np.inf == float('inf')
data1 = [1, 3, 45, 0, 25, 13, 43]
data2 = [1, 1, 1, 1, 1, 1, 1]
df1 = add_constant(pd.DataFrame(data1)) ## add a constant column so we fit an intercept
df1 = df1.reset_index() ## just doing this to make the index a column of the data frame
df1 = df1.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df2 = add_constant(pd.DataFrame(data2)) ## this does nothing because the y column is already a constant
df2 = df2.reset_index()
df2 = df2.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
formula1 = 'y ~ x + const' ## define formulae
formula2 = 'y ~ x'
model1 = ols(formula1, df1).fit()
model2 = ols(formula2, df2).fit()
print(levene(model1.resid, model2.resid))
levene 测试的输出如下所示:
LeveneResult(statistic=7.317386741297884, pvalue=0.019129208414097015)
因此我们可以拒绝残差分布在 alpha=0.05 时具有相同方差的原假设。
现在没有必要测试线性回归系数,因为残差不具有相同的分布。重要的是要记住,在回归问题中,比较独立于它们所拟合的数据的回归系数是没有意义的。回归系数的分布取决于数据的分布。
让我们看看当我们尝试建议的测试时会发生什么。将上面的说明与 OLS 包中的 this method 相结合会产生以下代码:
## stack the data and addd the indicator variable as described in:
## stackexchange question:
df1['c'] = 1 ## add indicator variable that tags the first groups of points
df_all = df1.append(df2, ignore_index=True).drop('const', axis=1)
df_all = df_all.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df_all = df_all.fillna(0) ## a bunch of the values are missing in the indicator columns after stacking
df_all['int'] = df_all['x'] * df_all['c'] # construct the interaction column
print(df_all) ## look a the data
formula = 'y ~ x + c + int' ## define the linear model using the formula api
result = ols(formula, df_all).fit()
hypotheses = '(c = 0), (int = 0)'
f_test = result.f_test(hypotheses)
print(f_test)
f 检验的结果如下所示:
<F test: F=array([[4.01995453]]), p=0.05233934453138028, df_denom=10, df_num=2>
f 检验的结果意味着我们几乎无法拒绝假设变量中指定的任何原假设,即指标变量的系数 'c' 和交互项 'int' 为零。
从这个例子可以清楚地看出,如果残差不具有相同的方差,则回归系数的 f 检验不是很有效。
请注意,给定示例的点太少,统计测试很难清楚地区分这两种情况,即使在人眼看来它们非常不同。这是因为即使统计测试旨在对数据做出很少的假设,但您拥有的数据越多,这些假设就会越好。在测试统计方法以查看它们是否符合您的期望时,通常最好从构建噪声较小的大样本开始,然后随着数据集变得越来越小和噪声越来越大,看看这些方法的效果如何。
为了完整起见,我将构建一个示例,其中 Levene 检验无法区分两个回归模型,但 f 检验会成功区分。这个想法是将嘈杂数据集的回归与其反向进行比较。残差的分布将相同,但变量之间的关系将大不相同。请注意,这将无法逆转前面示例中给出的嘈杂数据集,因为数据太嘈杂以至于 f 测试无法区分正斜率和负斜率。
import numpy as np
import pandas as pd
from scipy.stats import levene
from statsmodels.tools import add_constant
from statsmodels.formula.api import ols ## use formula api to make the tests easier
n_samples = 6
noise = np.random.randn(n_samples) * 5
data1 = np.linspace(0, 30, n_samples) + noise
data2 = data1[::-1] ## reverse the time series
df1 = add_constant(pd.DataFrame(data1)) ## add a constant column so we fit an intercept
df1 = df1.reset_index() ## just doing this to make the index a column of the data frame
df1 = df1.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df2 = add_constant(pd.DataFrame(data2)) ## this does nothing because the y column is already a constant
df2 = df2.reset_index()
df2 = df2.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
formula1 = 'y ~ x + const' ## define formulae
formula2 = 'y ~ x'
model1 = ols(formula1, df1).fit()
model2 = ols(formula2, df2).fit()
print(levene(model1.resid, model2.resid))
## stack the data and addd the indicator variable as described in:
## stackexchange question:
df1['c'] = 1 ## add indicator variable that tags the first groups of points
df_all = df1.append(df2, ignore_index=True).drop('const', axis=1)
df_all = df_all.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df_all = df_all.fillna(0) ## a bunch of the values are missing in the indicator columns after stacking
df_all['int'] = df_all['x'] * df_all['c'] # construct the interaction column
print(df_all) ## look a the data
formula = 'y ~ x + c + int' ## define the linear model using the formula api
result = ols(formula, df_all).fit()
hypotheses = '(c = 0), (int = 0)'
f_test = result.f_test(hypotheses)
print(f_test)
Levene检验和f检验的结果如下:
LeveneResult(statistic=5.451203655948632e-31, pvalue=1.0)
<F test: F=array([[10.62788052]]), p=0.005591319998324387, df_denom=8, df_num=2>
最后一点,因为我们正在对这些数据进行多重比较,如果我们得到显着的结果就停止,即如果 Levene 测试拒绝我们退出的 null,如果没有,那么我们进行 f 测试,这是一个逐步的假设检验,我们实际上是在夸大我们的假阳性错误率。在我们报告结果之前,我们应该为多重比较修正我们的 p 值。请注意,f 检验已经针对我们检验的关于回归系数的假设执行此操作。我对这些测试程序的基本假设有点模糊,所以我不能 100% 确定您最好进行以下更正,但请记住这一点,以防您觉得自己经常收到误报。
from statsmodels.sandbox.stats.multicomp import multipletests
print(multipletests([1, .005591], .05)) ## correct out pvalues given that we did two comparisons
输出如下所示:
(array([False, True]), array([1. , 0.01115074]), 0.025320565519103666, 0.025)
这意味着我们在校正下拒绝了第二个原假设,并且校正后的 p 值看起来像 [1., 0.011150]。最后两个值是在两种不同的校正方法下对显着性水平的校正。
我希望这对任何试图从事此类工作的人有所帮助。如果有人有什么要补充的,我欢迎评论。这不是我的专业领域,所以我可能会犯一些错误。
我有以下两个系列的数据。我想为 df1
创建一个 OLS 线性回归模型,为 df2
创建另一个 OLS 线性回归模型。然后统计检验这两个线性回归模型的y轴截距是否具有统计差异(p<0.05),同时检验这两个线性回归模型的斜率是否具有统计差异(p<0.05)。我做了以下
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
np.inf == float('inf')
data1 = [1, 3, 45, 0, 25, 13, 43]
data2 = [1, 1, 1, 1, 1, 1, 1]
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
fig, ax = plt.subplots()
df1.plot(figsize=(20, 10), linewidth=5, fontsize=18, ax=ax, kind='line')
df2.plot(figsize=(20, 10), linewidth=5, fontsize=18, ax=ax, kind='line')
plt.show()
model1 = sm.OLS(df1, df1.index)
model2 = sm.OLS(df2, df2.index)
results1 = model1.fit()
results2 = model2.fit()
print(results1.summary())
print(results2.summary())
结果#1
OLS Regression Results
=======================================================================================
Dep. Variable: 0 R-squared (uncentered): 0.625
Model: OLS Adj. R-squared (uncentered): 0.563
Method: Least Squares F-statistic: 10.02
Date: Mon, 01 Mar 2021 Prob (F-statistic): 0.0194
Time: 20:34:34 Log-Likelihood: -29.262
No. Observations: 7 AIC: 60.52
Df Residuals: 6 BIC: 60.47
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 5.6703 1.791 3.165 0.019 1.287 10.054
==============================================================================
Omnibus: nan Durbin-Watson: 2.956
Prob(Omnibus): nan Jarque-Bera (JB): 0.769
Skew: 0.811 Prob(JB): 0.681
Kurtosis: 2.943 Cond. No. 1.00
==============================================================================
结果 #2
OLS Regression Results
=======================================================================================
Dep. Variable: 0 R-squared (uncentered): 0.692
Model: OLS Adj. R-squared (uncentered): 0.641
Method: Least Squares F-statistic: 13.50
Date: Mon, 01 Mar 2021 Prob (F-statistic): 0.0104
Time: 20:39:14 Log-Likelihood: -5.8073
No. Observations: 7 AIC: 13.61
Df Residuals: 6 BIC: 13.56
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.2308 0.063 3.674 0.010 0.077 0.384
==============================================================================
Omnibus: nan Durbin-Watson: 0.148
Prob(Omnibus): nan Jarque-Bera (JB): 0.456
Skew: 0.000 Prob(JB): 0.796
Kurtosis: 1.750 Cond. No. 1.00
==============================================================================
这是我目前所了解的,但我认为有问题。这些回归结果似乎都没有显示 y 截距。另外,我希望结果 #2 中的 coef
为 0,因为我希望当所有值都为 1 时斜率为 0,但结果显示 0.2308
。任何建议或指导 material 将不胜感激。
在 statsmodels 中,默认情况下 OLS 模型不适合截距 (see the docs)。
exog array_like A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant.
关于 OLS 构造函数的 exog 参数的文档建议使用工具模块的 this feature 以便向数据添加截距。
对系数值执行假设检验 this question 提供了一些指导。不幸的是,这仅在残差的方差相同时才有效。
我们可以先看看每个分布的残差是否具有相同的方差(使用 Levine 检验),暂时忽略回归模型的系数。
import numpy as np
import pandas as pd
from scipy.stats import levene
from statsmodels.tools import add_constant
from statsmodels.formula.api import ols ## use formula api to make the tests easier
np.inf == float('inf')
data1 = [1, 3, 45, 0, 25, 13, 43]
data2 = [1, 1, 1, 1, 1, 1, 1]
df1 = add_constant(pd.DataFrame(data1)) ## add a constant column so we fit an intercept
df1 = df1.reset_index() ## just doing this to make the index a column of the data frame
df1 = df1.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df2 = add_constant(pd.DataFrame(data2)) ## this does nothing because the y column is already a constant
df2 = df2.reset_index()
df2 = df2.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
formula1 = 'y ~ x + const' ## define formulae
formula2 = 'y ~ x'
model1 = ols(formula1, df1).fit()
model2 = ols(formula2, df2).fit()
print(levene(model1.resid, model2.resid))
levene 测试的输出如下所示:
LeveneResult(statistic=7.317386741297884, pvalue=0.019129208414097015)
因此我们可以拒绝残差分布在 alpha=0.05 时具有相同方差的原假设。
现在没有必要测试线性回归系数,因为残差不具有相同的分布。重要的是要记住,在回归问题中,比较独立于它们所拟合的数据的回归系数是没有意义的。回归系数的分布取决于数据的分布。
让我们看看当我们尝试建议的测试时会发生什么。将上面的说明与 OLS 包中的 this method 相结合会产生以下代码:
## stack the data and addd the indicator variable as described in:
## stackexchange question:
df1['c'] = 1 ## add indicator variable that tags the first groups of points
df_all = df1.append(df2, ignore_index=True).drop('const', axis=1)
df_all = df_all.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df_all = df_all.fillna(0) ## a bunch of the values are missing in the indicator columns after stacking
df_all['int'] = df_all['x'] * df_all['c'] # construct the interaction column
print(df_all) ## look a the data
formula = 'y ~ x + c + int' ## define the linear model using the formula api
result = ols(formula, df_all).fit()
hypotheses = '(c = 0), (int = 0)'
f_test = result.f_test(hypotheses)
print(f_test)
f 检验的结果如下所示:
<F test: F=array([[4.01995453]]), p=0.05233934453138028, df_denom=10, df_num=2>
f 检验的结果意味着我们几乎无法拒绝假设变量中指定的任何原假设,即指标变量的系数 'c' 和交互项 'int' 为零。
从这个例子可以清楚地看出,如果残差不具有相同的方差,则回归系数的 f 检验不是很有效。
请注意,给定示例的点太少,统计测试很难清楚地区分这两种情况,即使在人眼看来它们非常不同。这是因为即使统计测试旨在对数据做出很少的假设,但您拥有的数据越多,这些假设就会越好。在测试统计方法以查看它们是否符合您的期望时,通常最好从构建噪声较小的大样本开始,然后随着数据集变得越来越小和噪声越来越大,看看这些方法的效果如何。
为了完整起见,我将构建一个示例,其中 Levene 检验无法区分两个回归模型,但 f 检验会成功区分。这个想法是将嘈杂数据集的回归与其反向进行比较。残差的分布将相同,但变量之间的关系将大不相同。请注意,这将无法逆转前面示例中给出的嘈杂数据集,因为数据太嘈杂以至于 f 测试无法区分正斜率和负斜率。
import numpy as np
import pandas as pd
from scipy.stats import levene
from statsmodels.tools import add_constant
from statsmodels.formula.api import ols ## use formula api to make the tests easier
n_samples = 6
noise = np.random.randn(n_samples) * 5
data1 = np.linspace(0, 30, n_samples) + noise
data2 = data1[::-1] ## reverse the time series
df1 = add_constant(pd.DataFrame(data1)) ## add a constant column so we fit an intercept
df1 = df1.reset_index() ## just doing this to make the index a column of the data frame
df1 = df1.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df2 = add_constant(pd.DataFrame(data2)) ## this does nothing because the y column is already a constant
df2 = df2.reset_index()
df2 = df2.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
formula1 = 'y ~ x + const' ## define formulae
formula2 = 'y ~ x'
model1 = ols(formula1, df1).fit()
model2 = ols(formula2, df2).fit()
print(levene(model1.resid, model2.resid))
## stack the data and addd the indicator variable as described in:
## stackexchange question:
df1['c'] = 1 ## add indicator variable that tags the first groups of points
df_all = df1.append(df2, ignore_index=True).drop('const', axis=1)
df_all = df_all.rename(columns={'index':'x', 0:'y'}) ## the old index will now be called x and the old values are now y
df_all = df_all.fillna(0) ## a bunch of the values are missing in the indicator columns after stacking
df_all['int'] = df_all['x'] * df_all['c'] # construct the interaction column
print(df_all) ## look a the data
formula = 'y ~ x + c + int' ## define the linear model using the formula api
result = ols(formula, df_all).fit()
hypotheses = '(c = 0), (int = 0)'
f_test = result.f_test(hypotheses)
print(f_test)
Levene检验和f检验的结果如下:
LeveneResult(statistic=5.451203655948632e-31, pvalue=1.0)
<F test: F=array([[10.62788052]]), p=0.005591319998324387, df_denom=8, df_num=2>
最后一点,因为我们正在对这些数据进行多重比较,如果我们得到显着的结果就停止,即如果 Levene 测试拒绝我们退出的 null,如果没有,那么我们进行 f 测试,这是一个逐步的假设检验,我们实际上是在夸大我们的假阳性错误率。在我们报告结果之前,我们应该为多重比较修正我们的 p 值。请注意,f 检验已经针对我们检验的关于回归系数的假设执行此操作。我对这些测试程序的基本假设有点模糊,所以我不能 100% 确定您最好进行以下更正,但请记住这一点,以防您觉得自己经常收到误报。
from statsmodels.sandbox.stats.multicomp import multipletests
print(multipletests([1, .005591], .05)) ## correct out pvalues given that we did two comparisons
输出如下所示:
(array([False, True]), array([1. , 0.01115074]), 0.025320565519103666, 0.025)
这意味着我们在校正下拒绝了第二个原假设,并且校正后的 p 值看起来像 [1., 0.011150]。最后两个值是在两种不同的校正方法下对显着性水平的校正。
我希望这对任何试图从事此类工作的人有所帮助。如果有人有什么要补充的,我欢迎评论。这不是我的专业领域,所以我可能会犯一些错误。