从列中为数据子集创建虚拟对象，该子集不包含该列中的所有类别值

Question

我正在处理大型数据集的一个子集。

数据框中有一列名为 "type"。 "type" 的值应为 [1,2,3,4].

在某个子集中，我发现 "type" 列仅包含某些值，例如 [1,4]，如

 In [1]: df
 Out[2]:
          type
    0      1
    1      4

当我从该子集的 "type" 列创建虚拟对象时，结果如下：

In [3]:import pandas as pd
In [4]:pd.get_dummies(df["type"], prefix = "type")
Out[5]:        type_1 type_4
        0        1       0
        1        0       1

它没有名为 "type_2"、"type_3" 的列。我想要的是：

 Out[6]:        type_1 type_2 type_3 type_4
            0      1      0       0      0
            1      0      0       0      1

有解决办法吗？

Answer 1

您需要做的是将列 'type' 变成 pd.Categorical 并指定 categories

pd.get_dummies(pd.Categorical(df.type, [1, 2, 3, 4]), prefix='type')

   type_1  type_2  type_3  type_4
0       1       0       0       0
1       0       0       0       1

Answer 2

reindex_axis and add_prefix的另一个解决方案：

df1 = pd.get_dummies(df["type"])
        .reindex_axis([1,2,3,4], axis=1, fill_value=0)
        .add_prefix('type')
print (df1)
   type1  type2  type3  type4
0      1      0      0      0
1      0      0      0      1

或categorical解决方法：

df1 = pd.get_dummies(df["type"].astype('category', categories=[1, 2, 3, 4]), prefix='type')
print (df1)
   type_1  type_2  type_3  type_4
0       1       0       0       0
1       0       0       0       1

Answer 3

由于您将 post 标记为 one-hot-encoding，您可能会发现 sklearn 模块的 OneHotEncoder 除了纯粹的 Pandas 解决方案外还有用：

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# sample data
df = pd.DataFrame({'type':[1,4]})
n_vals = 5

# one-hot encoding
encoder = OneHotEncoder(n_values=n_vals, sparse=False, dtype=int)
data = encoder.fit_transform(df.type.values.reshape(-1,1))

# encoded data frame
newdf = pd.DataFrame(data, columns=['type_{}'.format(x) for x in range(n_vals)])

print(newdf)

   type_0  type_1  type_2  type_3  type_4
0       0       1       0       0       0
1       0       0       0       0       1

使用这种方法的一个优点是 OneHotEncoder 很容易为非常大的 class 集生成稀疏向量。（只需在 OneHotEncoder() 声明中更改为 sparse=True。）

从列中为数据子集创建虚拟对象，该子集不包含该列中的所有类别值

create dummies from a column for a subset of data, which does't contains all the category value in that column

python-3.x

pandas

one-hot-encoding