Python:与 Stata 相比,回归速度较慢(固定效应假人)
Python: Regression slow compared with Stata (fixed-effect dummies)
我正在尝试 运行 Python 中的回归,但它只需要很长时间就停止了 运行ning。在 Stata 中它可以工作并且只需要几秒钟。
这是由于分类列,包括组固定效应。
没有变量,Stata 和 Python 的性能相当,200,000 次观察大约需要 1 秒:
代码统计
reg income height Number_children
代码Python
model = smf.ols(income ~ height + Number_children, data=humans).fit()
添加虚拟对象,我将 Stata 代码更改为 areg
:
areg income height Number_children, absorb(Village)
只比没有假人多 1-2 秒。
在Python中:
model = smf.ols(income ~ height + Number_children + Village, data=humans).fit()
其中:
Name: Village, dtype: category
Categories (3678, object):
我等了2分钟后停止回归。
有什么想法如何获得代码 运行ning,并将速度提高到几乎与 Stata 一样快?问题是由变量引起的还是由回归命令引起的?
- 编辑:
根据 Dimitriy 的回复,我对所有变量都尝试了这个:
例如:
humans["income_gr_m"]= humans["income"].groupby(humans['Village']).mean()
humans["income_star"] = humans["income"] - humans["income_gr_m"] + humans["income"].mean()
然而,这也使 Python 工作至少 2 分钟(我又停止了)。或者应该以不同的方式执行转换?谢谢
areg
实际上并没有像您在 Python 中那样用 3,677 个村庄指标反转矩阵。它正在以一种无需这样做的方式转换数据,因此速度会快得多。这也是为什么来自 regress
的常数与来自 areg
的常数不匹配的原因,尽管斜率系数应该相同,如果你等待 Python 完成。
这里是 areg
用 regress
计算系数的方法。标准误差会太大,因为我没有对 5 个吸收效应进行自由度调整,但我将在下面通过乘以 SE 手动进行调整:
. sysuse auto, clear
(1978 Automobile Data)
. drop if missing(rep78)
(5 observations deleted)
. /* (1) transform the data by subtracting the group specific mean and */
. /* adding the grand/overall mean back in for outcome and regressors */
. foreach var of varlist price weight length foreign {
2. bys rep78: egen group_mean = mean(`var')
3. qui sum `var'
4. gen double `var'_star = `var' - group_mean + r(mean)
5. drop group_mean
6. }
. /* (2) Fit the model on transformed data */
. regress price_star weight_star length_star foreign_star
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(3, 65) = 26.99
Model | 315296838 3 105098946 Prob > F = 0.0000
Residual | 253139578 65 3894455.05 R-squared = 0.5547
-------------+---------------------------------- Adj R-squared = 0.5341
Total | 568436416 68 8359359.06 Root MSE = 1973.4
------------------------------------------------------------------------------
price_star | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight_star | 6.15521 1.008605 6.10 0.000 4.140885 8.169534
length_star | -100.9268 33.82508 -2.98 0.004 -168.4801 -33.37341
foreign_star | 3394.052 782.454 4.34 0.000 1831.383 4956.72
_cons | 5453.782 3829.487 1.42 0.159 -2194.232 13101.8
------------------------------------------------------------------------------
. /* (3) Adjust the SEs for DoF */
. foreach coef in weight_star length_star foreign_star _cons {
2. di "Adjusted SE for `coef': " %9.8gc _se[`coef']*sqrt(65/61)
3. }
Adjusted SE for weight_star: 1.041149
Adjusted SE for length_star: 34.91649
Adjusted SE for foreign_star: 807.7009
Adjusted SE for _cons: 3953.05
. /* (4) Make sure areg gives the same output */
. areg price weight length foreign, absorb(rep78)
Linear regression, absorbing indicators Number of obs = 69
F( 3, 61) = 25.33
Prob > F = 0.0000
R-squared = 0.5611
Adj R-squared = 0.5108
Root MSE = 2037.1129
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | 6.15521 1.041149 5.91 0.000 4.073303 8.237116
length | -100.9268 34.91649 -2.89 0.005 -170.7466 -31.10692
foreign | 3394.052 807.7009 4.20 0.000 1778.954 5009.149
_cons | 5453.782 3953.05 1.38 0.173 -2450.831 13358.39
-------------+----------------------------------------------------------------
rep78 | F(4, 61) = 0.261 0.902 (5 categories)
统计代码:
cls
sysuse auto, clear
drop if missing(rep78)
/* (1) transform the data by subtracting the group specific mean and */
/* adding the grand/overall mean back in for outcome and regressors */
foreach var of varlist price weight length foreign {
bys rep78: egen group_mean = mean(`var')
qui sum `var'
gen double `var'_star = `var' - group_mean + r(mean)
drop group_mean
}
/* (2) Fit the model on transformed data */
regress price_star weight_star length_star foreign_star
/* (3) Adjust the SEs for DoF */
foreach coef in weight_star length_star foreign_star _cons {
di "Adjusted SE for `coef': " %9.8gc _se[`coef']*sqrt(65/61)
}
/* (4) Make sure areg gives the same output */
areg price weight length foreign, absorb(rep78)
我正在尝试 运行 Python 中的回归,但它只需要很长时间就停止了 运行ning。在 Stata 中它可以工作并且只需要几秒钟。
这是由于分类列,包括组固定效应。 没有变量,Stata 和 Python 的性能相当,200,000 次观察大约需要 1 秒:
代码统计
reg income height Number_children
代码Python
model = smf.ols(income ~ height + Number_children, data=humans).fit()
添加虚拟对象,我将 Stata 代码更改为 areg
:
areg income height Number_children, absorb(Village)
只比没有假人多 1-2 秒。
在Python中:
model = smf.ols(income ~ height + Number_children + Village, data=humans).fit()
其中:
Name: Village, dtype: category
Categories (3678, object):
我等了2分钟后停止回归。 有什么想法如何获得代码 运行ning,并将速度提高到几乎与 Stata 一样快?问题是由变量引起的还是由回归命令引起的?
- 编辑:
根据 Dimitriy 的回复,我对所有变量都尝试了这个:
例如:
humans["income_gr_m"]= humans["income"].groupby(humans['Village']).mean()
humans["income_star"] = humans["income"] - humans["income_gr_m"] + humans["income"].mean()
然而,这也使 Python 工作至少 2 分钟(我又停止了)。或者应该以不同的方式执行转换?谢谢
areg
实际上并没有像您在 Python 中那样用 3,677 个村庄指标反转矩阵。它正在以一种无需这样做的方式转换数据,因此速度会快得多。这也是为什么来自 regress
的常数与来自 areg
的常数不匹配的原因,尽管斜率系数应该相同,如果你等待 Python 完成。
这里是 areg
用 regress
计算系数的方法。标准误差会太大,因为我没有对 5 个吸收效应进行自由度调整,但我将在下面通过乘以 SE 手动进行调整:
. sysuse auto, clear
(1978 Automobile Data)
. drop if missing(rep78)
(5 observations deleted)
. /* (1) transform the data by subtracting the group specific mean and */
. /* adding the grand/overall mean back in for outcome and regressors */
. foreach var of varlist price weight length foreign {
2. bys rep78: egen group_mean = mean(`var')
3. qui sum `var'
4. gen double `var'_star = `var' - group_mean + r(mean)
5. drop group_mean
6. }
. /* (2) Fit the model on transformed data */
. regress price_star weight_star length_star foreign_star
Source | SS df MS Number of obs = 69
-------------+---------------------------------- F(3, 65) = 26.99
Model | 315296838 3 105098946 Prob > F = 0.0000
Residual | 253139578 65 3894455.05 R-squared = 0.5547
-------------+---------------------------------- Adj R-squared = 0.5341
Total | 568436416 68 8359359.06 Root MSE = 1973.4
------------------------------------------------------------------------------
price_star | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight_star | 6.15521 1.008605 6.10 0.000 4.140885 8.169534
length_star | -100.9268 33.82508 -2.98 0.004 -168.4801 -33.37341
foreign_star | 3394.052 782.454 4.34 0.000 1831.383 4956.72
_cons | 5453.782 3829.487 1.42 0.159 -2194.232 13101.8
------------------------------------------------------------------------------
. /* (3) Adjust the SEs for DoF */
. foreach coef in weight_star length_star foreign_star _cons {
2. di "Adjusted SE for `coef': " %9.8gc _se[`coef']*sqrt(65/61)
3. }
Adjusted SE for weight_star: 1.041149
Adjusted SE for length_star: 34.91649
Adjusted SE for foreign_star: 807.7009
Adjusted SE for _cons: 3953.05
. /* (4) Make sure areg gives the same output */
. areg price weight length foreign, absorb(rep78)
Linear regression, absorbing indicators Number of obs = 69
F( 3, 61) = 25.33
Prob > F = 0.0000
R-squared = 0.5611
Adj R-squared = 0.5108
Root MSE = 2037.1129
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | 6.15521 1.041149 5.91 0.000 4.073303 8.237116
length | -100.9268 34.91649 -2.89 0.005 -170.7466 -31.10692
foreign | 3394.052 807.7009 4.20 0.000 1778.954 5009.149
_cons | 5453.782 3953.05 1.38 0.173 -2450.831 13358.39
-------------+----------------------------------------------------------------
rep78 | F(4, 61) = 0.261 0.902 (5 categories)
统计代码:
cls
sysuse auto, clear
drop if missing(rep78)
/* (1) transform the data by subtracting the group specific mean and */
/* adding the grand/overall mean back in for outcome and regressors */
foreach var of varlist price weight length foreign {
bys rep78: egen group_mean = mean(`var')
qui sum `var'
gen double `var'_star = `var' - group_mean + r(mean)
drop group_mean
}
/* (2) Fit the model on transformed data */
regress price_star weight_star length_star foreign_star
/* (3) Adjust the SEs for DoF */
foreach coef in weight_star length_star foreign_star _cons {
di "Adjusted SE for `coef': " %9.8gc _se[`coef']*sqrt(65/61)
}
/* (4) Make sure areg gives the same output */
areg price weight length foreign, absorb(rep78)