python 中虚拟变量之间的相互作用
Interactions between dummies variables in python
我正在尝试了解如何在使用 get_dummies
后对列进行寻址。
例如,假设我有三个分类变量。
第一个变量有 2 个级别。
第二个变量有 5 个级别。
第三个变量有 2 个水平。
df=pd.DataFrame({"a":["Yes","Yes","No","No","No","Yes","Yes"], "b":["a","b","c","d","e","a","c"],"c":["1","2","2","1","2","1","1"]})
我为所有三个变量创建了虚拟变量,以便在 python 中的 sklearn
回归中使用它们。
df1 = pd.get_dummies(df,drop_first=True)
现在我想创建两个交互(乘法):bc,ba
如何在不使用它们的特定名称的情况下创建每个虚拟变量与另一个虚拟变量之间的乘法:
df1['a_yes_b'] = df1['a_Yes']*df1['b_b']
df1['a_yes_c'] = df1['a_Yes']*df1['b_c']
df1['a_yes_d'] = df1['a_Yes']*df1['b_d']
df1['a_yes_e'] = df1['a_Yes']*df1['b_e']
df1['c_2_b'] = df1['c_2']*df1['b_b']
df1['c_2_c'] = df1['c_2']*df1['b_c']
df1['c_2_d'] = df1['c_2']*df1['b_d']
df1['c_2_e'] = df1['c_2']*df1['b_e']
谢谢。
您可以使用循环来创建新列,要过滤列名,可以使用 boolean indexing
and str.startswith
:
进行过滤
a = df1.columns[df1.columns.str.startswith('a')]
b = df1.columns[df1.columns.str.startswith('b')]
c = df1.columns[df1.columns.str.startswith('c')]
for col1 in b:
for col2 in a:
df1[col2 + '_' + col1.split('_')[1]] = df1[col1].mul(df1[col2])
for col1 in b:
for col2 in c:
df1[col2 + '_' + col1.split('_')[1]] = df1[col1].mul(df1[col2])
print (df1)
a_Yes b_b b_c b_d b_e c_2 a_Yes_b a_Yes_c a_Yes_d a_Yes_e c_2_b \
0 1 0 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 1 1 0 0 0 1
2 0 0 1 0 0 1 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0 0 0
4 0 0 0 0 1 1 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0 0
6 1 0 1 0 0 0 0 1 0 0 0
c_2_c c_2_d c_2_e
0 0 0 0
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 1
5 0 0 0
6 0 0 0
但是如果 a
和 b
只有一列(在样本中是,在实际数据中可能)使用:filter
, mul
, squeeze
and concat
:
a = df1.filter(regex='^a')
b = df1.filter(regex='^b')
c = df1.filter(regex='^c')
dfa = b.mul(a.squeeze(), axis=0).rename(columns=lambda x: a.columns[0] + x[1:])
dfc = b.mul(c.squeeze(), axis=0).rename(columns=lambda x: c.columns[0] + x[1:])
df1 = pd.concat([df1, dfa, dfc], axis=1)
print (df1)
a_Yes b_b b_c b_d b_e c_2 a_Yes_b a_Yes_c a_Yes_d a_Yes_e c_2_b \
0 1 0 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 1 1 0 0 0 1
2 0 0 1 0 0 1 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0 0 0
4 0 0 0 0 1 1 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0 0
6 1 0 1 0 0 0 0 1 0 0 0
c_2_c c_2_d c_2_e
0 0 0 0
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 1
5 0 0 0
6 0 0 0
您可以将数据帧列转换为 numpy 数组,然后相应地相乘。这里是 link,您可以在其中找到执行此操作的方法:
这解决了您的问题:
def get_design_with_pair_interaction(data, group_pair):
""" Get the design matrix with the pairwise interactions
Parameters
----------
data (pandas.DataFrame):
Pandas data frame with the two variables to build the design matrix of their two main effects and their interaction
group_pair (iterator):
List with the name of the two variables (name of the columns) to build the design matrix of their two main effects and their interaction
Returns
-------
x_new (pandas.DataFrame):
Pandas data frame with the design matrix of their two main effects and their interaction
"""
x = pd.get_dummies(data[group_pair])
interactions_lst = list(
itertools.combinations(
x.columns.tolist(),
2,
),
)
x_new = x.copy()
for level_1, level_2 in interactions_lst:
if level_1.split('_')[0] == level_2.split('_')[0]:
continue
x_new = pd.concat(
[
x_new,
x[level_1] * x[level_2]
],
axis=1,
)
x_new = x_new.rename(
columns = {
0: (level_1 + '_' + level_2)
}
)
return x_new
我正在尝试了解如何在使用 get_dummies
后对列进行寻址。
例如,假设我有三个分类变量。
第一个变量有 2 个级别。
第二个变量有 5 个级别。
第三个变量有 2 个水平。
df=pd.DataFrame({"a":["Yes","Yes","No","No","No","Yes","Yes"], "b":["a","b","c","d","e","a","c"],"c":["1","2","2","1","2","1","1"]})
我为所有三个变量创建了虚拟变量,以便在 python 中的 sklearn
回归中使用它们。
df1 = pd.get_dummies(df,drop_first=True)
现在我想创建两个交互(乘法):bc,ba
如何在不使用它们的特定名称的情况下创建每个虚拟变量与另一个虚拟变量之间的乘法:
df1['a_yes_b'] = df1['a_Yes']*df1['b_b']
df1['a_yes_c'] = df1['a_Yes']*df1['b_c']
df1['a_yes_d'] = df1['a_Yes']*df1['b_d']
df1['a_yes_e'] = df1['a_Yes']*df1['b_e']
df1['c_2_b'] = df1['c_2']*df1['b_b']
df1['c_2_c'] = df1['c_2']*df1['b_c']
df1['c_2_d'] = df1['c_2']*df1['b_d']
df1['c_2_e'] = df1['c_2']*df1['b_e']
谢谢。
您可以使用循环来创建新列,要过滤列名,可以使用 boolean indexing
and str.startswith
:
a = df1.columns[df1.columns.str.startswith('a')]
b = df1.columns[df1.columns.str.startswith('b')]
c = df1.columns[df1.columns.str.startswith('c')]
for col1 in b:
for col2 in a:
df1[col2 + '_' + col1.split('_')[1]] = df1[col1].mul(df1[col2])
for col1 in b:
for col2 in c:
df1[col2 + '_' + col1.split('_')[1]] = df1[col1].mul(df1[col2])
print (df1)
a_Yes b_b b_c b_d b_e c_2 a_Yes_b a_Yes_c a_Yes_d a_Yes_e c_2_b \
0 1 0 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 1 1 0 0 0 1
2 0 0 1 0 0 1 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0 0 0
4 0 0 0 0 1 1 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0 0
6 1 0 1 0 0 0 0 1 0 0 0
c_2_c c_2_d c_2_e
0 0 0 0
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 1
5 0 0 0
6 0 0 0
但是如果 a
和 b
只有一列(在样本中是,在实际数据中可能)使用:filter
, mul
, squeeze
and concat
:
a = df1.filter(regex='^a')
b = df1.filter(regex='^b')
c = df1.filter(regex='^c')
dfa = b.mul(a.squeeze(), axis=0).rename(columns=lambda x: a.columns[0] + x[1:])
dfc = b.mul(c.squeeze(), axis=0).rename(columns=lambda x: c.columns[0] + x[1:])
df1 = pd.concat([df1, dfa, dfc], axis=1)
print (df1)
a_Yes b_b b_c b_d b_e c_2 a_Yes_b a_Yes_c a_Yes_d a_Yes_e c_2_b \
0 1 0 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 1 1 0 0 0 1
2 0 0 1 0 0 1 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0 0 0
4 0 0 0 0 1 1 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0 0
6 1 0 1 0 0 0 0 1 0 0 0
c_2_c c_2_d c_2_e
0 0 0 0
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 1
5 0 0 0
6 0 0 0
您可以将数据帧列转换为 numpy 数组,然后相应地相乘。这里是 link,您可以在其中找到执行此操作的方法:
这解决了您的问题:
def get_design_with_pair_interaction(data, group_pair):
""" Get the design matrix with the pairwise interactions
Parameters
----------
data (pandas.DataFrame):
Pandas data frame with the two variables to build the design matrix of their two main effects and their interaction
group_pair (iterator):
List with the name of the two variables (name of the columns) to build the design matrix of their two main effects and their interaction
Returns
-------
x_new (pandas.DataFrame):
Pandas data frame with the design matrix of their two main effects and their interaction
"""
x = pd.get_dummies(data[group_pair])
interactions_lst = list(
itertools.combinations(
x.columns.tolist(),
2,
),
)
x_new = x.copy()
for level_1, level_2 in interactions_lst:
if level_1.split('_')[0] == level_2.split('_')[0]:
continue
x_new = pd.concat(
[
x_new,
x[level_1] * x[level_2]
],
axis=1,
)
x_new = x_new.rename(
columns = {
0: (level_1 + '_' + level_2)
}
)
return x_new