如何根据列的百分位数从 DataFrame 中采样?
How to sample from DataFrame based on percentile of a column?
给定这样的数据集:
import pandas as pd
rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60},
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40},
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11},
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10},
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3},
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1},
{'key': 'HAHA', 'freq': 1}]
df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])
[输出]:
key freq percent
0 ABC 100 0.328947
1 DEF 60 0.197368
2 GHI 50 0.164474
3 JKL 40 0.131579
4 MNO 13 0.042763
5 PQR 11 0.036184
6 STU 10 0.032895
7 VWX 10 0.032895
8 YZZ 3 0.009868
9 WHYQ 3 0.009868
10 HOWEE 2 0.006579
11 DUH 1 0.003289
12 HAHA 1 0.003289
目标是
- select 1 个来自频率百分位数前 50-100 的示例
- select 2 个来自 10-50 百分位数的示例和
- select 4 个示例来自 < 10 个百分点
在这种情况下,适合的答案是:
- 从
['ABC', 'DEF']
中选1
- 从
['GHI', 'JKL', 'MNO', 'PQR']
中选择 2 个
- 从
['VWX', 'STU', 'YZZ', 'WHYQ', 'HOWEE', 'HAHA', 'DUH']
中选出4个
我试过这个:
import random
import pandas as pd
rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60},
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40},
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11},
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10},
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3},
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1},
{'key': 'HAHA', 'freq': 1}]
df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])
bin_50_100 = []
bin_10_50 = []
bin_10 = []
total_percent = 1.0
for idx, row in df.sort_values(by=['freq', 'key'], ascending=False).iterrows():
if total_percent > 0.5:
bin_50_100.append(row['key'])
elif 0.1 < total_percent < 0.5:
bin_10_50.append(row['key'])
else:
bin_10.append(row['key'])
total_percent -= row['percent']
print(random.sample(bin_50_100, 1))
print(random.sample(bin_10_50, 2))
print(random.sample(bin_10, 4))
[输出]:
['DEF']
['MNO', 'PQR']
['HOWEE', 'WHYQ', 'HAHA', 'DUH']
但是有没有更简单的方法来解决这个问题呢?
让我们试试:
bins = [0, 0.1, 0.5, 1]
samples = [3,3,1]
df['sample'] = pd.cut(df.percent[::-1].cumsum(), # accumulate percentage
bins=[0, 0.1, 0.5, 1], # bins
labels=False # num samples
).astype(int)
df.groupby('sample').apply(lambda x: x.sample(n=samples[x['sample'].iloc[0])] )
输出:
key freq percent sample
sample
1 0 ABC 100 0.328947 1
2 2 GHI 50 0.164474 2
5 PQR 11 0.036184 2
4 7 VWX 10 0.032895 4
6 STU 10 0.032895 4
12 HAHA 1 0.003289 4
10 HOWEE 2 0.006579 4
看看这是否有帮助。
df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])
s = list(1 - df['percent'].cumsum())
s.pop(-1)
s.insert(0,1.0)
df['cum_lag'] = s
print(df[df['cum_lag'] > 0.5]['key'])
给定这样的数据集:
import pandas as pd
rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60},
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40},
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11},
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10},
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3},
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1},
{'key': 'HAHA', 'freq': 1}]
df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])
[输出]:
key freq percent
0 ABC 100 0.328947
1 DEF 60 0.197368
2 GHI 50 0.164474
3 JKL 40 0.131579
4 MNO 13 0.042763
5 PQR 11 0.036184
6 STU 10 0.032895
7 VWX 10 0.032895
8 YZZ 3 0.009868
9 WHYQ 3 0.009868
10 HOWEE 2 0.006579
11 DUH 1 0.003289
12 HAHA 1 0.003289
目标是
- select 1 个来自频率百分位数前 50-100 的示例
- select 2 个来自 10-50 百分位数的示例和
- select 4 个示例来自 < 10 个百分点
在这种情况下,适合的答案是:
- 从
['ABC', 'DEF']
中选1
- 从
['GHI', 'JKL', 'MNO', 'PQR']
中选择 2 个
- 从
['VWX', 'STU', 'YZZ', 'WHYQ', 'HOWEE', 'HAHA', 'DUH']
中选出4个
我试过这个:
import random
import pandas as pd
rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60},
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40},
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11},
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10},
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3},
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1},
{'key': 'HAHA', 'freq': 1}]
df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])
bin_50_100 = []
bin_10_50 = []
bin_10 = []
total_percent = 1.0
for idx, row in df.sort_values(by=['freq', 'key'], ascending=False).iterrows():
if total_percent > 0.5:
bin_50_100.append(row['key'])
elif 0.1 < total_percent < 0.5:
bin_10_50.append(row['key'])
else:
bin_10.append(row['key'])
total_percent -= row['percent']
print(random.sample(bin_50_100, 1))
print(random.sample(bin_10_50, 2))
print(random.sample(bin_10, 4))
[输出]:
['DEF']
['MNO', 'PQR']
['HOWEE', 'WHYQ', 'HAHA', 'DUH']
但是有没有更简单的方法来解决这个问题呢?
让我们试试:
bins = [0, 0.1, 0.5, 1]
samples = [3,3,1]
df['sample'] = pd.cut(df.percent[::-1].cumsum(), # accumulate percentage
bins=[0, 0.1, 0.5, 1], # bins
labels=False # num samples
).astype(int)
df.groupby('sample').apply(lambda x: x.sample(n=samples[x['sample'].iloc[0])] )
输出:
key freq percent sample
sample
1 0 ABC 100 0.328947 1
2 2 GHI 50 0.164474 2
5 PQR 11 0.036184 2
4 7 VWX 10 0.032895 4
6 STU 10 0.032895 4
12 HAHA 1 0.003289 4
10 HOWEE 2 0.006579 4
看看这是否有帮助。
df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])
s = list(1 - df['percent'].cumsum())
s.pop(-1)
s.insert(0,1.0)
df['cum_lag'] = s
print(df[df['cum_lag'] > 0.5]['key'])