我如何从头开始编写代码以按目标变量进行分层抽样？

Question

全部，我正在尝试从头开始创建（不使用 sklearn 库）以创建 5 个样本（df 的 len / 5），这样每个样本都具有与原始数据集相同比例的目标变量（1） .例如原来有 5% 的癌症患者，我希望我的 5 个样本中的每一个也有 5% 的目标变量。不确定如何操作，

df_list=[]
n= round(len(df)/5)
for m in range(1,6):
    m = m*n
    print(df[:m])
    df_list.append(df[:m])

这创建了我想要的每个块，但我现在如何才能使目标变量与原始变量的百分比相同？

Answer 1

解决方案：

import numpy as np
import math

def stratify(data, target='y', n=10):
    array = data.values
    y = data[target].values
    
    unique, counts = np.unique(data[target].values, return_counts=True)
    new_counts = counts * (n/sum(counts))
    new_counts = fit_new_counts_to_n(new_counts, n)
    
    selected_count = np.zeros(len(unique))
    selected_row_indices = []
    for i in range(array.shape[0]):
        if sum(selected_count) == sum(new_counts):
            break
        cr_target_value = y[i]
        cr_target_index = np.where(unique==cr_target_value)[0][0]
        if selected_count[cr_target_index] < new_counts[cr_target_index]:
            selected_row_indices.append(i)
            selected_count[cr_target_index] += 1
    row_indices_mask = np.array([x in selected_row_indices for x in np.arange(array.shape[0])])
    
    return pd.DataFrame(array[row_indices_mask], columns=data.columns)

效用 class:

def fit_new_counts_to_n(new_counts, n):
    decimals = [math.modf(x)[0] for x in new_counts]
    integers = [int(math.modf(x)[1]) for x in new_counts]
    arg_max = np.array(map(np.argmax, decimals))
    sorting_indices =  np.argsort(decimals)[::-1][:n]
    for i in sorting_indices:
        if sum(integers) < n:
            integers[i] += 1
        else:
            break
    return integers

用法示例：

data = [[  3,   0],
        [ 54,   3],
        [  3,   1],
        [ 64,   1],
        [ 65,   0],
        [ 34,   1],
        [ 45,   2],
        [534,   2],
        [ 57,   1],
        [ 64,   3],
        [  5,   1],
        [ 45,   1],
        [546,   1],
        [  4,   2],
        [ 53,   3],
        [345,   2],
        [456,   2],
        [435,   3],
        [545,   1],
        [ 45,   3]]

data = pd.DataFrame(data, columns=['X1', 'y'])

stratified_data = stratify(data, target='y', n=10)

结果：

      [[  3,   0],
       [ 54,   3],
       [  3,   1],
       [ 64,   1],
       [ 34,   1],
       [ 45,   2],
       [534,   2],
       [ 57,   1],
       [ 64,   3],
       [ 53,   3]]

我如何从头开始编写代码以按目标变量进行分层抽样？

how can i write from scratch code to do stratified sampling by target variable?

python

sampling

pandas

imbalanced-data