python 中的 50/50 采样

Question

我有一个包含二进制目标变量的数据集，该变量的分割率为 4/96。我想创建一个 50/50 分割的数据子集。我想知道在 Python 中做到这一点的最佳方法。谢谢！

Answer 1

一般的答案（不是硬连接到 4/96 拆分）是将数据分成两组（0 组和 1 组），然后根据需要从每个分区中抽取尽可能多的样本。（该技术称为“stratified random sampling”）。

# Partition based on the target variable
group0 = [record for record in data if not record.target_variable]
group1 = [record for record in data if record.target_variable]

# Pick as many as needed from each partition
subgroup0 = random.sample(group0, k=4) 
subgroup1 = random.sample(group1, k=4)

# Combine and shuffle the results
combined = subgroup0 + subgroup1
random.shuffle(combined)

Answer 2

您可以groupby()您的二元变量，然后从每组中抽样。

生成一些随机数据：

>>> df = pd.DataFrame([{'variable': ''.join(random.sample('abcdefghijklmnopqrstuvwxyz', 4)), 'outcome': (random.random() > .94)} for i in range(100)])

>>> print(df)
    outcome variable
0     False     irlk
1     False     ylmp
2     True      przk
3     False     xldf
4     False     oxsp
5     False     uytn
6     False     ifmw
7     True      lepa
8     False     zfvm
...
99    False     qjek
100   False     umtw

根据需要取样：

>>> num_samples = 3
>>> df.groupby('outcome').apply(lambda x: x.sample(num_samples))
            outcome variable
outcome                     
False   71    False     jdrp
        98    False     eqrj
        78    False     tnzl
True    29     True     uvjr
        36     True     tiwn
        63     True     tabr

python 中的 50/50 采样

50/50 Sampling in python

python

numpy

sampling

pandas