Pandas 按行数百分比分组

Pandas Groupby Head by Percentage of Row Counts

我有一个数据框:

state    city             score
CA       San Francisco    80
CA       San Francisco    90
...
NC       Raleigh          44
NY       New York City    22

我想做一个 groupby.head(),但不是整数值,我想 select 前 80%,按分数排序, 每个州-城市组合。

因此,如果 CA、San Francisco 有 100 行,而 NC、Raleigh 有 20 行,则最终数据框将具有 CA、San Francisco 的前 80 个得分行和 NC、Raleigh 的前 16 个得分行。

因此最终结果代码可能类似于:

df.sort_values('score', ascending=False).groupby(['State', 'City']).head(80%)

谢谢!

from io import StringIO
import pandas as pd

# sample data
s = """state,city,score
CA,San Francisco,80
CA,San Francisco,90
CA,San Francisco,30
CA,San Francisco,10
CA,San Francisco,70
CA,San Francisco,60
CA,San Francisco,50
CA,San Francisco,40
NC,Raleigh,44
NC,Raleigh,54
NC,Raleigh,64
NC,Raleigh,14
NY,New York City,22
NY,New York City,12
NY,New York City,32
NY,New York City,42
NY,New York City,52"""

df = pd.read_csv(StringIO(s))

sample = .8 # 80% 
# sort the values and create a groupby object
g = df.sort_values('score', ascending=False).groupby(['state', 'city']) 
# use list comprehension to iterate over each group
# for each group, calculate what 80% is
# in other words, the length of each group multiplied by .8
# you then use int to round down to the whole number
new_df = pd.concat([data.head(int(len(data)*sample)) for _,data in g])

   state           city  score
1     CA  San Francisco     90
0     CA  San Francisco     80
4     CA  San Francisco     70
5     CA  San Francisco     60
6     CA  San Francisco     50
7     CA  San Francisco     40
10    NC        Raleigh     64
9     NC        Raleigh     54
8     NC        Raleigh     44
16    NY  New York City     52
15    NY  New York City     42
14    NY  New York City     32
12    NY  New York City     22

使用nlargest并根据其长度计算每个组的选定行数,即0.8 * len(group)

res = (
    df.groupby(['State', 'City'], group_keys=False)
      .apply(lambda g: g.nlargest(int(0.8*len(g)), "Score"))
)