Python Pandas 将列中的新值替换为 'other'

Question

我有一个 pandas 数据框，其因子列具有 30 个不同的水平。有些级别很少出现，因此我将它们转换为 'Other' 分组。结果列有 25 个不同的级别加上 1 个 'other' 级别。

d = df1['column1'].value_counts() >= 50
df1['column1'] = [i if d[i] else 'Other' for i in df1['column1']]
df1['column1'] = df1['column1'].astype('category')

我有第二个数据框，我想将其转换为与第一个数据框具有相同的级别（包括第一个数据框中未出现的任何新级别）。我已经尝试了下面的代码，但我得到了 'Key Error' 但它并没有真正解释问题。

df2['column1'] = [i if d[i] else 'Other' for i in df2['column1']]
df2['column1'] = df2['column1'].astype('category')

知道是什么原因造成的吗？

Answer 1

通过向 df2['column1'] 注入 df1['column1'].

中不存在的值，我能够使用您的代码重现您的 Key Error

您可以通过以下方式使该过程具有弹性：

df1 = pd.DataFrame({'column1': [f'L{x}' for x in np.random.randint(10, size=100)]})

df2 包含附加值：

df2 = pd.DataFrame({'column1': [f'L{x}' for x in np.random.randint(12, size=100)]})

获取最频繁的关卡并翻译：

cat_counts = df1['column1'].value_counts()

df1.assign(column1=np.where(df1['column1'].isin(cat_counts[cat_counts > 10].index), df1['column1'], 'other')).astype({'column1': 'category'})

   column1
0       L4
1       L9
2       L9
3    other
4    other
..     ...
95   other
96   other
97   other
98      L3
99   other

同样的构造也适用于 df2，即使它包含 df1 中不存在的值：

df2.assign(column1=np.where(df2['column1'].isin(cat_counts[cat_counts > 10].index), df2['column1'], 'other')).astype({'column1': 'category'})

   column1
0    other
1       L9
2    other
3    other
4    other
..     ...
95   other
96   other
97   other
98      L9
99   other

另一种选择是 select n 个最频繁的级别：

df1['column1'].isin(cat_counts.nlargest(5).index)

Python Pandas 将列中的新值替换为 'other'

Python Pandas replace new values in column with 'other'

python

pandas