Python: 不在 statsmodels 摘要中显示假人

Python: Do not show dummies in statsmodels summary

我正在使用 statsmodels 创建一些回归输出:

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col
import numpy as np 
import pandas as pd 

x1 = pd.Series(np.random.randn(2000))
x2 = pd.Series(np.random.randn(2000))
aa_milne_arr = ['a', 'b', 'c', 'd', "e", "f", "g", "h", "i"]
dummy = pd.Series(np.random.choice(aa_milne_arr, 2000,))
depen = pd.Series(np.random.randn(2000))
df = pd.DataFrame({"y": depen, "x1": x1, "x2": x2, "dummy": dummy})
df['const'] = 1
df['xsqr'] = df['x1']**2  
mod = smf.ols('y ~ x1 + x2 + dummy', data=df)
mod2 = smf.ols('y ~ x1 + x2 + xsqr + dummy', data=df)
res = mod.fit()
res2 = mod2.fit()

print (summary_col([res,res2],stars=True,float_format='%0.3f',
                  model_names=['one\n(0)','two\n(1)'],
                  info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.2f}".format(x.rsquared)}))

一切正常,但我有一个包含许多假人的大数据集(比示例中的多得多)。因此,我想从汇总输出中排除假人(而不是回归本身)。有可能吗?

一种快速而肮脏的方法是首先在最后的 summary_col 中找到那些 dummy 索引,然后避免打印它们:

summary = summary_col(
    [res,res2],stars=True,float_format='%0.3f',
    model_names=['one\n(0)','two\n(1)'],
    info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
    'R2':lambda x: "{:.2f}".format(x.rsquared)})

# As string
# summary_str = str(summary).split('\n')
# LaTeX format
summary_str = summary.as_latex().split('\n')

# Find dummy indexes
dummy_idx = []
for i, li in enumerate(summary_str):
    if li.startswith('dummy'):
        dummy_idx.append(i)
        dummy_idx.append(i + 1)

# Print summary avoiding dummy indexes
for i, li in enumerate(summary_str):
    if i not in dummy_idx:
        print(li)

它不是很漂亮,但是很管用。使用字符串格式:

==========================
             one     two  
             (0)     (1)  
--------------------------
Intercept  0.029   -0.000 
           (0.065) (0.068)
x1         0.023   0.025  
           (0.022) (0.022)
x2         -0.014  -0.014 
           (0.022) (0.022)
xsqr               0.024  
                   (0.016)
N          2000    2000   
R2         0.00    0.00   
==========================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01

使用 LaTeX 格式:

\begin{table}
\caption{}
\begin{center}
\begin{tabular}{lcc}
\hline
           &   one   &   two    \
           &   (0)   &   (1)    \
\hline
\hline
\end{tabular}
\begin{tabular}{lll}
Intercept  & 0.070   & 0.067    \
           & (0.069) & (0.071)  \
x1         & 0.001   & 0.001    \
           & (0.022) & (0.022)  \
x2         & -0.024  & -0.025   \
           & (0.022) & (0.022)  \
xsqr       &         & 0.003    \
           &         & (0.015)  \
N          & 2000    & 2000     \
R2         & 0.01    & 0.01     \
\hline
\end{tabular}
\end{center}
\end{table}

我会在 summary_col 中使用 regressor_order 参数,它允许您指定首先显示哪些回归量(或者如果您指定 drop_omitted=True 则完全省略)。

示例:

all_regressors = sorted(list(set(res1.exog_names) | set(res2.exog_names)))
# Drop the dummies using some logic on their names.
all_regressors_no_fe = [var_name for var_name in all_regressors if not var_name.startswith('C(')]

print (summary_col([res,res2],stars=True,float_format='%0.3f',
                  model_names=['one\n(0)','two\n(1)'],
                  info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.2f}".format(x.rsquared)},
                  regressor_order=all_regressors_no_fe,
                  drop_omitted=True))