将 pandas 数据框列映射到字典

Question

我有一个数据框的案例，其中包含一个高基数的分类变量（许多唯一值）。我想将该变量重新编码为一组值（最常见的值），并将所有其他值替换为一个包罗万象的类别 ("others")。举个简单的例子：

以下是应保持不变的两个值：

top_values = ['apple', 'orange']

我根据它们在以下数据框列中的频率建立了它们：

{'fruits': {0: 'apple',
1: 'apple',
2: 'orange',
3: 'orange',
4: 'banana',
5: 'grape'}}

该数据框列应重新编码如下：

{'fruits': {0: 'apple',
1: 'apple',
2: 'orange',
3: 'orange',
4: 'other',
5: 'other'}}

怎么做？（数据框有数百万条记录）

Answer 1

您至少可以使用以下几种方法：

df['fruits'].where(df['fruits'].isin(top_values), 'other', inplace=True)

df.loc[~df['fruits'].isin(top_values), 'fruits'] = 'other'

在这个过程之后，您可能希望将您的系列变成分类：

df['fruits'] = df['fruits'].astype('category')

在值替换操作之前执行此操作可能无济于事，因为您的输入系列具有高基数。

Answer 2

df.newCol = df.apply(lambda row: row.fruits if row.fruits in top_values else 'others' )

Mapping pandas dataframe column to a dictionary