Python:与 Stata 相比,回归速度较慢(固定效应假人)

Python: Regression slow compared with Stata (fixed-effect dummies)

我正在尝试 运行 Python 中的回归,但它只需要很长时间就停止了 运行ning。在 Stata 中它可以工作并且只需要几秒钟。

这是由于分类列,包括组固定效应。 没有变量,Stata 和 Python 的性能相当,200,000 次观察大约需要 1 秒:

代码统计

reg income height Number_children

代码Python

model = smf.ols(income ~ height + Number_children, data=humans).fit() 

添加虚拟对象,我将 Stata 代码更改为 areg:

areg income height Number_children, absorb(Village)

只比没有假人多 1-2 秒。

在Python中:

model = smf.ols(income ~ height + Number_children + Village, data=humans).fit()

其中:

Name: Village, dtype: category
Categories (3678, object):

我等了2分钟后停止回归。 有什么想法如何获得代码 运行ning,并将速度提高到几乎与 Stata 一样快?问题是由变量引起的还是由回归命令引起的?

根据 Dimitriy 的回复,我对所有变量都尝试了这个:

例如:

humans["income_gr_m"]= humans["income"].groupby(humans['Village']).mean()
humans["income_star"] = humans["income"] - humans["income_gr_m"] + humans["income"].mean()

然而,这也使 Python 工作至少 2 分钟(我又停止了)。或者应该以不同的方式执行转换?谢谢

areg 实际上并没有像您在 Python 中那样用 3,677 个村庄指标反转矩阵。它正在以一种无需这样做的方式转换数据,因此速度会快得多。这也是为什么来自 regress 的常数与来自 areg 的常数不匹配的原因,尽管斜率系数应该相同,如果你等待 Python 完成。

这里是 aregregress 计算系数的方法。标准误差会太大,因为我没有对 5 个吸收效应进行自由度调整,但我将在下面通过乘以 SE 手动进行调整:

. sysuse auto, clear
(1978 Automobile Data)

. drop if missing(rep78)
(5 observations deleted)

. /* (1) transform the data by subtracting the group specific mean and */
. /* adding the grand/overall mean back in for outcome and regressors */
. foreach var of varlist price weight length foreign {
  2.         bys rep78: egen group_mean = mean(`var')
  3.         qui sum `var'
  4.         gen double `var'_star = `var' - group_mean + r(mean)
  5.         drop group_mean
  6. }

. /* (2) Fit the model on transformed data */
. regress price_star weight_star length_star foreign_star

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(3, 65)        =     26.99
       Model |   315296838         3   105098946   Prob > F        =    0.0000
    Residual |   253139578        65  3894455.05   R-squared       =    0.5547
-------------+----------------------------------   Adj R-squared   =    0.5341
       Total |   568436416        68  8359359.06   Root MSE        =    1973.4

------------------------------------------------------------------------------
  price_star |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 weight_star |    6.15521   1.008605     6.10   0.000     4.140885    8.169534
 length_star |  -100.9268   33.82508    -2.98   0.004    -168.4801   -33.37341
foreign_star |   3394.052    782.454     4.34   0.000     1831.383     4956.72
       _cons |   5453.782   3829.487     1.42   0.159    -2194.232     13101.8
------------------------------------------------------------------------------

. /* (3) Adjust the SEs for DoF */
. foreach coef in weight_star length_star foreign_star _cons {
  2.         di "Adjusted SE for `coef': " %9.8gc _se[`coef']*sqrt(65/61)
  3. }
Adjusted SE for weight_star:  1.041149
Adjusted SE for length_star:  34.91649
Adjusted SE for foreign_star:  807.7009
Adjusted SE for _cons:   3953.05

. /* (4) Make sure areg gives the same output */
. areg price weight length foreign, absorb(rep78)

Linear regression, absorbing indicators         Number of obs     =         69
                                                F(   3,     61)   =      25.33
                                                Prob > F          =     0.0000
                                                R-squared         =     0.5611
                                                Adj R-squared     =     0.5108
                                                Root MSE          =  2037.1129

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |    6.15521   1.041149     5.91   0.000     4.073303    8.237116
      length |  -100.9268   34.91649    -2.89   0.005    -170.7466   -31.10692
     foreign |   3394.052   807.7009     4.20   0.000     1778.954    5009.149
       _cons |   5453.782    3953.05     1.38   0.173    -2450.831    13358.39
-------------+----------------------------------------------------------------
       rep78 |          F(4, 61) =      0.261   0.902           (5 categories)

统计代码:

cls
sysuse auto, clear
drop if missing(rep78)
/* (1) transform the data by subtracting the group specific mean and */
/* adding the grand/overall mean back in for outcome and regressors */
foreach var of varlist price weight length foreign {
    bys rep78: egen group_mean = mean(`var')
    qui sum `var'
    gen double `var'_star = `var' - group_mean + r(mean)
    drop group_mean
}
/* (2) Fit the model on transformed data */
regress price_star weight_star length_star foreign_star
/* (3) Adjust the SEs for DoF */
foreach coef in weight_star length_star foreign_star _cons {
    di "Adjusted SE for `coef': " %9.8gc _se[`coef']*sqrt(65/61)
}
/* (4) Make sure areg gives the same output */
areg price weight length foreign, absorb(rep78)