如何从按 groupby 列加权的数据框中采样
How can I sample from a dataframe weighted by groupby column
Here是一种采样方式。
我试过了:
sample=2000
sample_df = df.groupby('prefix').sample(n=sample, random_state=1)
它将 df
按 prefix
分组,并且对于每组,它采样 2k 个项目。我有9组。我想采样 18k,但按每组中的数字加权。
IIUC,这是一种方式:
sample = 2000
col_name = "prefix"
probs = df[col_name].map(df[col_name].value_counts())
sample_df = df.sample(n=sample, weights=probs)
probs
是prefix
列中每个值对应的(未归一化)权重,我们据此抽样。
一些示例数据的步骤:
>>> df
B C D
0 this 0.469112 -0.861849
1 this -0.282863 -2.104569
2 other -1.509059 -0.494929
3 view -1.135632 1.071804
4 other 1.212112 0.721555
5 other -0.173215 -0.706771
6 this 0.119209 -1.039575
7 view -1.044236 0.271860
8 other 0.322124 2.010234
>>> col_name = "B"
>>> sample = 4
>>> counts = df[col_name].value_counts()
>>> counts
other 4
this 3
view 2
Name: B, dtype: int64
>>> probs = df[col_name].map(counts)
>>> probs
0 3
1 3
2 4
3 2
4 4
5 4
6 3
7 2
8 4
Name: B, dtype: int64
# seeing side-by-side with df.B
>>> pd.concat([df.B, probs], axis=1)
0 this 3
1 this 3
2 other 4
3 view 2
4 other 4
5 other 4
6 this 3
7 view 2
8 other 4
即,col_name
中的每个值都附有一个数字,相对而言,该数字衡量其从列中的计数推断出的权重。
# sampling:
>>> sample_df = df.sample(n=sample, weights=probs, random_state=1284)
>>> sample_df
B C D
6 this 0.119209 -1.039575
3 view -1.135632 1.071804
2 other -1.509059 -0.494929
5 other -0.173215 -0.706771
Here是一种采样方式。 我试过了:
sample=2000
sample_df = df.groupby('prefix').sample(n=sample, random_state=1)
它将 df
按 prefix
分组,并且对于每组,它采样 2k 个项目。我有9组。我想采样 18k,但按每组中的数字加权。
IIUC,这是一种方式:
sample = 2000
col_name = "prefix"
probs = df[col_name].map(df[col_name].value_counts())
sample_df = df.sample(n=sample, weights=probs)
probs
是prefix
列中每个值对应的(未归一化)权重,我们据此抽样。
一些示例数据的步骤:
>>> df
B C D
0 this 0.469112 -0.861849
1 this -0.282863 -2.104569
2 other -1.509059 -0.494929
3 view -1.135632 1.071804
4 other 1.212112 0.721555
5 other -0.173215 -0.706771
6 this 0.119209 -1.039575
7 view -1.044236 0.271860
8 other 0.322124 2.010234
>>> col_name = "B"
>>> sample = 4
>>> counts = df[col_name].value_counts()
>>> counts
other 4
this 3
view 2
Name: B, dtype: int64
>>> probs = df[col_name].map(counts)
>>> probs
0 3
1 3
2 4
3 2
4 4
5 4
6 3
7 2
8 4
Name: B, dtype: int64
# seeing side-by-side with df.B
>>> pd.concat([df.B, probs], axis=1)
0 this 3
1 this 3
2 other 4
3 view 2
4 other 4
5 other 4
6 this 3
7 view 2
8 other 4
即,col_name
中的每个值都附有一个数字,相对而言,该数字衡量其从列中的计数推断出的权重。
# sampling:
>>> sample_df = df.sample(n=sample, weights=probs, random_state=1284)
>>> sample_df
B C D
6 this 0.119209 -1.039575
3 view -1.135632 1.071804
2 other -1.509059 -0.494929
5 other -0.173215 -0.706771