Python 中不常见功能级别的一次热编码
One Hot Encoding of uncommon feature levels in Python
我有一个带有分类因子的模型。我使用 pandas.get_dummies
.
将其编码为 One Hot Encoding
尽管如此,分类因素有许多不常见的水平。如果我使用 pandas.get_dummies
重新编码新数据,新列可能是 'off',因为新级别不会出现在新数据中。
我正在考虑执行以下操作:
dummies_df = pd.get_dummies(list_of_all_possible_levels)
dummies_df[:] = 0
dummies_df.drop(dummies_df.index[1:], inplace=True)
# If there are 10 levels this becomes a 10x10 Dataframe. I only need
# one 'empty' row and drop everything after the first.
# Let's say the DataFrame looks like this:
df['categorical_factor', 'numeric_factor', 'other_numeric_factor']
# I want to do something where I flag the column of the feature as 1
# and append the one-row dummies_df to each row of df
for cat in df.categorical_factor:
dummies_df[cat] = 1
df['numeric_factor', 'other_numeric_factor'] + dummies_df
我只是不知道我是否应该像这样循环遍历行,还是有更好的 'cartesian product' 类型的答案。如果这是 R 我会做 cbind(df, dummies_df)
因为 R 知道回收 dummies_df
.
的值
或者也许我应该对新数据使用 pandas.get_dummies
并将缺失的级别作为新列加入,如下所示:
new_dat['missing_level_1'] = [0 for _ in new_dat.index]
new_dat['missing_level_2'] = [0 for _ in new_dat.index]
编辑:示例数据
levels=['level_1', 'level_2', 'level_3']
A = [0,1,2]
B = [3,4,5]
df = pd.DataFrame({'levels': levels, 'A': A, 'B': B})
df = df.drop('levels', axis=1).join(pd.get_dummies(df.levels))
new_levels=['level_1', 'level_2', 'level_2']
new_A = [5,6,7]
new_B = [8,9,7]
new_df = pd.DataFrame({'levels': new_levels, 'A': new_A, 'B': new_B})
new_df = new_df.drop('levels', axis=1).join(pd.get_dummies(new_df.levels))
df
现在是
+---------+---+---+---------+---------+---------+
| (index) | A | B | level_1 | level_2 | level_3 |
+---------+---+---+---------+---------+---------+
| 0 | 0 | 3 | 1 | 0 | 0 |
| 1 | 1 | 4 | 0 | 1 | 0 |
| 2 | 2 | 5 | 0 | 0 | 1 |
+---------+---+---+---------+---------+---------+
并且new_df
现在是
+---------+---+---+---------+---------+
| (index) | A | B | level_1 | level_2 |
+---------+---+---+---------+---------+
| 0 | 5 | 8 | 1 | 0 |
| 1 | 6 | 9 | 0 | 1 |
| 2 | 7 | 7 | 0 | 1 |
+---------+---+---+---------+---------+
(缺少 level_3
列。)
我希望new_df
成为
+---------+---+---+---------+---------+---------+
| (index) | A | B | level_1 | level_2 | level_3 |
+---------+---+---+---------+---------+---------+
| 0 | 5 | 8 | 1 | 0 | 0 |
| 1 | 6 | 9 | 0 | 1 | 0 |
| 2 | 7 | 7 | 0 | 1 | 0 |
+---------+---+---+---------+---------+---------+
最稳定的解决方案是reindex
假人的数据框。
当您对第一个(原型)数据帧进行编码时,您会记住虚拟列列表:
# the initial encoding
levels=['level_1', 'level_2', 'level_3']
df_original = pd.DataFrame({'levels': levels, 'A': [0,1,2], 'B': [3,4,5]})
dummies = pd.get_dummies(df_original.levels)
df = df_original.drop('levels', axis=1).join(dummies)
# remember the levels and their order
dummy_columns = list(dummies.columns)
之后,您强制新的虚拟数据框具有相同的列:
# encoding another dataframe
new_levels=['level_1', 'level_2', 'level_2']
new_df_original = pd.DataFrame({'levels': new_levels, 'A': [5,6,7], 'B': [8,9,7]})
# this is where I use the remembered information
new_dummies = pd.get_dummies(new_df_original.levels). \
reindex(columns=dummy_columns).fillna(0).astype(int)
new_df = new_df_original.drop('levels', axis=1).join(new_dummies)
print(new_df)
它给出了你想要的结果:
A B level_1 level_2 level_3
0 5 8 1 0 0
1 6 9 0 1 0
2 7 7 0 1 0
我有一个带有分类因子的模型。我使用 pandas.get_dummies
.
尽管如此,分类因素有许多不常见的水平。如果我使用 pandas.get_dummies
重新编码新数据,新列可能是 'off',因为新级别不会出现在新数据中。
我正在考虑执行以下操作:
dummies_df = pd.get_dummies(list_of_all_possible_levels)
dummies_df[:] = 0
dummies_df.drop(dummies_df.index[1:], inplace=True)
# If there are 10 levels this becomes a 10x10 Dataframe. I only need
# one 'empty' row and drop everything after the first.
# Let's say the DataFrame looks like this:
df['categorical_factor', 'numeric_factor', 'other_numeric_factor']
# I want to do something where I flag the column of the feature as 1
# and append the one-row dummies_df to each row of df
for cat in df.categorical_factor:
dummies_df[cat] = 1
df['numeric_factor', 'other_numeric_factor'] + dummies_df
我只是不知道我是否应该像这样循环遍历行,还是有更好的 'cartesian product' 类型的答案。如果这是 R 我会做 cbind(df, dummies_df)
因为 R 知道回收 dummies_df
.
或者也许我应该对新数据使用 pandas.get_dummies
并将缺失的级别作为新列加入,如下所示:
new_dat['missing_level_1'] = [0 for _ in new_dat.index]
new_dat['missing_level_2'] = [0 for _ in new_dat.index]
编辑:示例数据
levels=['level_1', 'level_2', 'level_3']
A = [0,1,2]
B = [3,4,5]
df = pd.DataFrame({'levels': levels, 'A': A, 'B': B})
df = df.drop('levels', axis=1).join(pd.get_dummies(df.levels))
new_levels=['level_1', 'level_2', 'level_2']
new_A = [5,6,7]
new_B = [8,9,7]
new_df = pd.DataFrame({'levels': new_levels, 'A': new_A, 'B': new_B})
new_df = new_df.drop('levels', axis=1).join(pd.get_dummies(new_df.levels))
df
现在是
+---------+---+---+---------+---------+---------+
| (index) | A | B | level_1 | level_2 | level_3 |
+---------+---+---+---------+---------+---------+
| 0 | 0 | 3 | 1 | 0 | 0 |
| 1 | 1 | 4 | 0 | 1 | 0 |
| 2 | 2 | 5 | 0 | 0 | 1 |
+---------+---+---+---------+---------+---------+
并且new_df
现在是
+---------+---+---+---------+---------+
| (index) | A | B | level_1 | level_2 |
+---------+---+---+---------+---------+
| 0 | 5 | 8 | 1 | 0 |
| 1 | 6 | 9 | 0 | 1 |
| 2 | 7 | 7 | 0 | 1 |
+---------+---+---+---------+---------+
(缺少 level_3
列。)
我希望new_df
成为
+---------+---+---+---------+---------+---------+
| (index) | A | B | level_1 | level_2 | level_3 |
+---------+---+---+---------+---------+---------+
| 0 | 5 | 8 | 1 | 0 | 0 |
| 1 | 6 | 9 | 0 | 1 | 0 |
| 2 | 7 | 7 | 0 | 1 | 0 |
+---------+---+---+---------+---------+---------+
最稳定的解决方案是reindex
假人的数据框。
当您对第一个(原型)数据帧进行编码时,您会记住虚拟列列表:
# the initial encoding
levels=['level_1', 'level_2', 'level_3']
df_original = pd.DataFrame({'levels': levels, 'A': [0,1,2], 'B': [3,4,5]})
dummies = pd.get_dummies(df_original.levels)
df = df_original.drop('levels', axis=1).join(dummies)
# remember the levels and their order
dummy_columns = list(dummies.columns)
之后,您强制新的虚拟数据框具有相同的列:
# encoding another dataframe
new_levels=['level_1', 'level_2', 'level_2']
new_df_original = pd.DataFrame({'levels': new_levels, 'A': [5,6,7], 'B': [8,9,7]})
# this is where I use the remembered information
new_dummies = pd.get_dummies(new_df_original.levels). \
reindex(columns=dummy_columns).fillna(0).astype(int)
new_df = new_df_original.drop('levels', axis=1).join(new_dummies)
print(new_df)
它给出了你想要的结果:
A B level_1 level_2 level_3
0 5 8 1 0 0
1 6 9 0 1 0
2 7 7 0 1 0