如何从按 groupby 列加权的数据框中采样

How can I sample from a dataframe weighted by groupby column

Here是一种采样方式。 我试过了:

sample=2000 
sample_df = df.groupby('prefix').sample(n=sample, random_state=1)

它将 dfprefix 分组,并且对于每组,它采样 2k 个项目。我有9组。我想采样 18k,但按每组中的数字加权。

IIUC,这是一种方式:

sample = 2000
col_name = "prefix"

probs = df[col_name].map(df[col_name].value_counts())
sample_df = df.sample(n=sample, weights=probs)

probsprefix列中每个值对应的(未归一化)权重,我们据此抽样。


一些示例数据的步骤:

>>> df

       B         C         D
0   this  0.469112 -0.861849
1   this -0.282863 -2.104569
2  other -1.509059 -0.494929
3   view -1.135632  1.071804
4  other  1.212112  0.721555
5  other -0.173215 -0.706771
6   this  0.119209 -1.039575
7   view -1.044236  0.271860
8  other  0.322124  2.010234

>>> col_name = "B"
>>> sample = 4

>>> counts = df[col_name].value_counts()
>>> counts

other    4
this     3
view     2
Name: B, dtype: int64

>>> probs = df[col_name].map(counts)
>>> probs

0    3
1    3
2    4
3    2
4    4
5    4
6    3
7    2
8    4
Name: B, dtype: int64

# seeing side-by-side with df.B
>>> pd.concat([df.B, probs], axis=1)

0   this  3
1   this  3
2  other  4
3   view  2
4  other  4
5  other  4
6   this  3
7   view  2
8  other  4

即,col_name 中的每个值都附有一个数字,相对而言,该数字衡量其从列中的计数推断出的权重。

# sampling:
>>> sample_df = df.sample(n=sample, weights=probs, random_state=1284)
>>> sample_df

       B         C         D
6   this  0.119209 -1.039575
3   view -1.135632  1.071804
2  other -1.509059 -0.494929
5  other -0.173215 -0.706771