Pandas one-hot-encode columns to dummy，包括 'other' 编码

Question

我的最终目标是在 Pandas 列上进行单热编码。在这种情况下，我想按如下方式对“b”列进行单热编码：保留苹果、香蕉和橙子，并将任何其他水果编码为“其他”。

示例：在下面的代码中，“葡萄柚”将被重写为“其他”，如果“奇异果”和“鳄梨”出现在我的数据中，它们也会被重写。

下面的代码有效：

df = pd.DataFrame({
    "a": [1,2,3,4,5],
    "b": ["apple", "banana", "banana", "orange", "grapefruit"],
    "c": [True, False, True, False, True],
})
print(df)

def analyze_fruit(s):
    if s in ("apple", "banana", "orange"):
        return s
    else:
        return "other"

df['b'] = df['b'].apply(analyze_fruit)

df2 = pd.get_dummies(df['b'], prefix='b')
print(df2)

我的问题：analyze_fruit() 业务是否有更短的方式？我尝试 DataFrame.replace() 否定前瞻断言但没有成功。

Answer 1

您可以设置 Categorical before get_dummies then fillna 任何与设置类别不匹配的内容都将变为 NaN，可以很容易地由 fillna 填充。分类的另一个好处是也可以通过添加 ordered=True:

来定义排序

df['b'] = pd.Categorical(
    df['b'],
    categories=['apple', 'banana', 'orange', 'other']
).fillna('other')

df2 = pd.get_dummies(df['b'], prefix='b')

用 np.where 之类的标准替换也适用于此，但通常虚拟列与分类数据一起使用，因此能够添加排序以便虚拟列按设定顺序出现可能会有所帮助：

# import numpy as np


df['b'] = np.where(df['b'].isin(['apple', 'banana', 'orange']),
                   df['b'],
                   'other')

df2 = pd.get_dummies(df['b'], prefix='b')

两者都产生 df2:

   b_apple  b_banana  b_orange  b_other
0        1         0         0        0
1        0         1         0        0
2        0         1         0        0
3        0         0         1        0
4        0         0         0        1

Pandas one-hot-encode columns to dummy，包括 'other' 编码

Pandas one-hot-encode columns to dummies, including an 'other' encoding

python

pandas

categorical-data

dummy-variable