pandas 类别：只保留最常见的，其余的替换为 NaN

Question

在 pd.Series 和 dtype=category 中，我有 253 个唯一值。其中一些经常发生，而另一些只发生一两次。现在我只想保留其中的前 10 个，并将其余的替换为 np.nan。

我已经 top = df['cats'].value_counts().head(10) 创建了我想保留的类别。但是现在呢？

类似 df['cats'].apply(cat_replace, args=top)?

的内容

def cat_replace(c, top):
    if c in top:
        return c
    else:
        return np.nan

然而，这对我来说看起来不太 'pandas'，我觉得有更好的方法。有更好的建议吗？

Answer 1

从

抄袭

你可以考虑做类似的事情

top = set(df['cats'].value_counts().head(10))
df['cats'].apply(top.__contains__)

Answer 2

# Sample data.
df = pd.DataFrame(
    {'cats': pd.Categorical(
        list('abcdefghij') * 5
        + list('klmnopqrstuvwxyz'))}
)

top_n = 10
top_cats = df['cats'].value_counts().head(top_n).index.tolist()
df.loc[~df['cats'].isin(top_cats), 'cats'] = np.nan

pandas 类别：只保留最常见的，其余的替换为 NaN

pandas category: keep only most common ones and replace rest with NaN

python

pandas

categorical-data