使用 pandas 或 statsmodel 为两列的交互创建虚拟变量

Question

我有这样一个数据框：

Index ID  Industry  years_spend       asset
6646  892         4            4  144.977037
2347  315        10            8  137.749138
7342  985         1            5  104.310217
137    18         5            5  156.593396
2840  381        11            2  229.538828
6579  883        11            1  171.380125
1776  235         4            7  217.734377
2691  361         1            2  148.865341
815   110        15            4  233.309491
2932  393        17            5  187.281724

我想为 Industry X years_spend 创建虚拟变量，这会创建 len(df.Industry.value_counts()) * len(df.years_spend.value_counts()) 变量，例如 d_11_4 = 1 对于具有 industry==1 和 years spend= 的所有行4 否则 d_11_4 = 0。然后我可以将这些变量用于一些回归工作。

我知道我可以使用 df.groupby(['Industry','years_spend']) 来创建我想要的组，我知道我可以使用 [= 为一列创建这样的变量statsmodels 中的 14=] 语法：

import statsmodels.formula.api as smf

mod = smf.ols("income ~   C(Industry)", data=df).fit()

但是如果我想处理 2 列，我会得到一个错误： IndexError: tuple index out of range

如何使用 pandas 或使用 statsmodels 中的某些函数来做到这一点？

Answer 1

你可以做这样的事情，你必须首先创建一个计算字段来封装 Industry 和 years_spend:

df = pd.DataFrame({'Industry': [4, 3, 11, 4, 1, 1], 'years_spend': [4, 5, 8, 4, 4, 1]})
df['industry_years'] = df['Industry'].astype('str') + '_' + df['years_spend'].astype('str')  # this is the calculated field

这是 df 的样子：

   Industry  years_spend industry_years
0         4            4            4_4
1         3            5            3_5
2        11            8           11_8
3         4            4            4_4
4         1            4            1_4
5         1            1            1_1

现在可以申请了get_dummies:

df = pd.get_dummies(df, columns=['industry_years'])

这会让你得到你想要的:)

Answer 2

使用 patsy 语法只是：

import statsmodels.formula.api as smf

mod = smf.ols("income ~ C(Industry):C(years_spend)", data=df).fit()

:字符表示"interaction"；您还可以将其推广到两个以上项目 (C(a):C(b):C(c)) 的交互、数值和分类值之间的交互等。您可能会发现 patsy docs useful.

使用 pandas 或 statsmodel 为两列的交互创建虚拟变量

Creating dummy variable using pandas or statsmodel for interaction of two columns

python

pandas

statsmodels

patsy