如何根据列的百分位数从 DataFrame 中采样?

How to sample from DataFrame based on percentile of a column?

给定这样的数据集:

import pandas as pd

rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60}, 
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40}, 
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11}, 
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10}, 
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3}, 
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1}, 
{'key': 'HAHA', 'freq': 1}]

df = pd.DataFrame(rows)

df['percent'] = df['freq'] / sum(df['freq'])

[输出]:

key freq    percent
0   ABC 100 0.328947
1   DEF 60  0.197368
2   GHI 50  0.164474
3   JKL 40  0.131579
4   MNO 13  0.042763
5   PQR 11  0.036184
6   STU 10  0.032895
7   VWX 10  0.032895
8   YZZ 3   0.009868
9   WHYQ    3   0.009868
10  HOWEE   2   0.006579
11  DUH 1   0.003289
12  HAHA    1   0.003289

目标是

  1. select 1 个来自频率百分位数前 50-100 的示例
  2. select 2 个来自 10-50 百分位数的示例和
  3. select 4 个示例来自 < 10 个百分点

在这种情况下,适合的答案是:

  1. ['ABC', 'DEF']
  2. 中选1
  3. ['GHI', 'JKL', 'MNO', 'PQR']
  4. 中选择 2 个
  5. ['VWX', 'STU', 'YZZ', 'WHYQ', 'HOWEE', 'HAHA', 'DUH']
  6. 中选出4个

我试过这个:

import random
import pandas as pd

rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60}, 
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40}, 
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11}, 
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10}, 
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3}, 
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1}, 
{'key': 'HAHA', 'freq': 1}]

df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])

bin_50_100 = []
bin_10_50 = []
bin_10 = []

total_percent = 1.0
for idx, row in df.sort_values(by=['freq', 'key'], ascending=False).iterrows():
    if total_percent > 0.5:
        bin_50_100.append(row['key'])
    elif 0.1 < total_percent < 0.5:
        bin_10_50.append(row['key'])
    else:
        bin_10.append(row['key'])
    total_percent -= row['percent']

    
    
print(random.sample(bin_50_100, 1))
print(random.sample(bin_10_50, 2))
print(random.sample(bin_10, 4))

[输出]:

['DEF']
['MNO', 'PQR']
['HOWEE', 'WHYQ', 'HAHA', 'DUH']

但是有没有更简单的方法来解决这个问题呢?

让我们试试:

bins = [0, 0.1, 0.5, 1]
samples = [3,3,1]

df['sample'] = pd.cut(df.percent[::-1].cumsum(),  # accumulate percentage
                              bins=[0, 0.1, 0.5, 1],      # bins
                              labels=False             # num samples 
                             ).astype(int)


df.groupby('sample').apply(lambda x: x.sample(n=samples[x['sample'].iloc[0])] )

输出:

             key  freq   percent  sample
sample                                  
1      0     ABC   100  0.328947       1
2      2     GHI    50  0.164474       2
       5     PQR    11  0.036184       2
4      7     VWX    10  0.032895       4
       6     STU    10  0.032895       4
       12   HAHA     1  0.003289       4
       10  HOWEE     2  0.006579       4

看看这是否有帮助。

df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])

s = list(1 - df['percent'].cumsum())
s.pop(-1)
s.insert(0,1.0)
df['cum_lag'] = s

print(df[df['cum_lag'] > 0.5]['key'])