Pandas 按两列分组，并获取其中一列的每个值的前 n 行，并按降序排序

Question

我有一个包含许多列的 pandas 数据框（感兴趣的两个列名称是 a 和 b）

我想按 a 和 b
计算每个组的出现次数
按出现的降序对每个组进行排序
对于 b 的每个值，我想取 a 的前 n 个值，它们出现次数最多。

我最多可以执行第 3 步，使用以下代码：

  a_b_count = df.groupby(['a', 'b']).size().reset_index().rename({0:'count'},axis='columns').sort_values('count', ascending = False)

但是，对于 b 的每个值，如何获得出现次数最多的 a 的前 n 个值？

例子

df =

     a           b       ...
     a1          b1      ...
     a2          b1      ...
     a1          b1      ...
     a1          b2      ...
     a2          b2      ...
     a2          b2      ...

预期输出（n = 1）：

    a            b       count
    b1           a1        2
    b2           a2        2

Answer 1

您可以使用 nlargest 而不是 sort。相对于系列大小 n 较小的会更快。

df.groupby(['a', 'b']).size().groupby(
    level=1).nlargest(n).reset_index(-1, drop=True)

b   a 
b1  a1    2
b2  a2    2
dtype: int64

Answer 2

这是一种方法，使用 crosstab 获取列 a 和 b 的频率：

pd.crosstab(df.a, df.b).stack().nlargest(1, keep="all").reset_index(name="count")

Pandas 按两列分组，并获取其中一列的每个值的前 n 行，并按降序排序

Pandas group by two columns and get top n rows of each value of one of the columns sorted in descending order

greatest-n-per-group

python-3.x

pandas

pandas-groupby