Python: 不在 statsmodels 摘要中显示假人
Python: Do not show dummies in statsmodels summary
我正在使用 statsmodels 创建一些回归输出:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col
import numpy as np
import pandas as pd
x1 = pd.Series(np.random.randn(2000))
x2 = pd.Series(np.random.randn(2000))
aa_milne_arr = ['a', 'b', 'c', 'd', "e", "f", "g", "h", "i"]
dummy = pd.Series(np.random.choice(aa_milne_arr, 2000,))
depen = pd.Series(np.random.randn(2000))
df = pd.DataFrame({"y": depen, "x1": x1, "x2": x2, "dummy": dummy})
df['const'] = 1
df['xsqr'] = df['x1']**2
mod = smf.ols('y ~ x1 + x2 + dummy', data=df)
mod2 = smf.ols('y ~ x1 + x2 + xsqr + dummy', data=df)
res = mod.fit()
res2 = mod2.fit()
print (summary_col([res,res2],stars=True,float_format='%0.3f',
model_names=['one\n(0)','two\n(1)'],
info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)}))
一切正常,但我有一个包含许多假人的大数据集(比示例中的多得多)。因此,我想从汇总输出中排除假人(而不是回归本身)。有可能吗?
一种快速而肮脏的方法是首先在最后的 summary_col
中找到那些 dummy
索引,然后避免打印它们:
summary = summary_col(
[res,res2],stars=True,float_format='%0.3f',
model_names=['one\n(0)','two\n(1)'],
info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)})
# As string
# summary_str = str(summary).split('\n')
# LaTeX format
summary_str = summary.as_latex().split('\n')
# Find dummy indexes
dummy_idx = []
for i, li in enumerate(summary_str):
if li.startswith('dummy'):
dummy_idx.append(i)
dummy_idx.append(i + 1)
# Print summary avoiding dummy indexes
for i, li in enumerate(summary_str):
if i not in dummy_idx:
print(li)
它不是很漂亮,但是很管用。使用字符串格式:
==========================
one two
(0) (1)
--------------------------
Intercept 0.029 -0.000
(0.065) (0.068)
x1 0.023 0.025
(0.022) (0.022)
x2 -0.014 -0.014
(0.022) (0.022)
xsqr 0.024
(0.016)
N 2000 2000
R2 0.00 0.00
==========================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01
使用 LaTeX 格式:
\begin{table}
\caption{}
\begin{center}
\begin{tabular}{lcc}
\hline
& one & two \
& (0) & (1) \
\hline
\hline
\end{tabular}
\begin{tabular}{lll}
Intercept & 0.070 & 0.067 \
& (0.069) & (0.071) \
x1 & 0.001 & 0.001 \
& (0.022) & (0.022) \
x2 & -0.024 & -0.025 \
& (0.022) & (0.022) \
xsqr & & 0.003 \
& & (0.015) \
N & 2000 & 2000 \
R2 & 0.01 & 0.01 \
\hline
\end{tabular}
\end{center}
\end{table}
我会在 summary_col
中使用 regressor_order
参数,它允许您指定首先显示哪些回归量(或者如果您指定 drop_omitted=True
则完全省略)。
示例:
all_regressors = sorted(list(set(res1.exog_names) | set(res2.exog_names)))
# Drop the dummies using some logic on their names.
all_regressors_no_fe = [var_name for var_name in all_regressors if not var_name.startswith('C(')]
print (summary_col([res,res2],stars=True,float_format='%0.3f',
model_names=['one\n(0)','two\n(1)'],
info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)},
regressor_order=all_regressors_no_fe,
drop_omitted=True))
我正在使用 statsmodels 创建一些回归输出:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col
import numpy as np
import pandas as pd
x1 = pd.Series(np.random.randn(2000))
x2 = pd.Series(np.random.randn(2000))
aa_milne_arr = ['a', 'b', 'c', 'd', "e", "f", "g", "h", "i"]
dummy = pd.Series(np.random.choice(aa_milne_arr, 2000,))
depen = pd.Series(np.random.randn(2000))
df = pd.DataFrame({"y": depen, "x1": x1, "x2": x2, "dummy": dummy})
df['const'] = 1
df['xsqr'] = df['x1']**2
mod = smf.ols('y ~ x1 + x2 + dummy', data=df)
mod2 = smf.ols('y ~ x1 + x2 + xsqr + dummy', data=df)
res = mod.fit()
res2 = mod2.fit()
print (summary_col([res,res2],stars=True,float_format='%0.3f',
model_names=['one\n(0)','two\n(1)'],
info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)}))
一切正常,但我有一个包含许多假人的大数据集(比示例中的多得多)。因此,我想从汇总输出中排除假人(而不是回归本身)。有可能吗?
一种快速而肮脏的方法是首先在最后的 summary_col
中找到那些 dummy
索引,然后避免打印它们:
summary = summary_col(
[res,res2],stars=True,float_format='%0.3f',
model_names=['one\n(0)','two\n(1)'],
info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)})
# As string
# summary_str = str(summary).split('\n')
# LaTeX format
summary_str = summary.as_latex().split('\n')
# Find dummy indexes
dummy_idx = []
for i, li in enumerate(summary_str):
if li.startswith('dummy'):
dummy_idx.append(i)
dummy_idx.append(i + 1)
# Print summary avoiding dummy indexes
for i, li in enumerate(summary_str):
if i not in dummy_idx:
print(li)
它不是很漂亮,但是很管用。使用字符串格式:
==========================
one two
(0) (1)
--------------------------
Intercept 0.029 -0.000
(0.065) (0.068)
x1 0.023 0.025
(0.022) (0.022)
x2 -0.014 -0.014
(0.022) (0.022)
xsqr 0.024
(0.016)
N 2000 2000
R2 0.00 0.00
==========================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01
使用 LaTeX 格式:
\begin{table}
\caption{}
\begin{center}
\begin{tabular}{lcc}
\hline
& one & two \
& (0) & (1) \
\hline
\hline
\end{tabular}
\begin{tabular}{lll}
Intercept & 0.070 & 0.067 \
& (0.069) & (0.071) \
x1 & 0.001 & 0.001 \
& (0.022) & (0.022) \
x2 & -0.024 & -0.025 \
& (0.022) & (0.022) \
xsqr & & 0.003 \
& & (0.015) \
N & 2000 & 2000 \
R2 & 0.01 & 0.01 \
\hline
\end{tabular}
\end{center}
\end{table}
我会在 summary_col
中使用 regressor_order
参数,它允许您指定首先显示哪些回归量(或者如果您指定 drop_omitted=True
则完全省略)。
示例:
all_regressors = sorted(list(set(res1.exog_names) | set(res2.exog_names)))
# Drop the dummies using some logic on their names.
all_regressors_no_fe = [var_name for var_name in all_regressors if not var_name.startswith('C(')]
print (summary_col([res,res2],stars=True,float_format='%0.3f',
model_names=['one\n(0)','two\n(1)'],
info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)},
regressor_order=all_regressors_no_fe,
drop_omitted=True))