将模型分数随机抽样为 4 组，分布相似 python

Question

我有一个模型分数范围为 0 到 1 的数据集。table 如下所示：

| Score |
| ----- |
| 0.55  |
| 0.67  |
| 0.21  |
| 0.05  |
| 0.91  |
| 0.15  |
| 0.33  |
| 0.47  |

我想把这些乐谱随机分成4组。 control、treatment 1、treatment 2、treatment 3。 control 组应该有 20% 的观察结果，其余 80% 必须分成其他 3 个大小相等的组。但是，我希望每个组中的分数分布相同。我如何使用 python?

解决这个问题

PS：这只是实际table的一个表示，但它会比这有更多的观察结果。

Answer 1

您可以使用 numpy.random.choice 设置具有定义概率的随机组，然后 groupby 拆分数据帧：

import numpy as np
group = np.random.choice(['control', 'treatment 1', 'treatment 2', 'treatment 3'],
                          size=len(df),
                          p=[.2, .8/3, .8/3, .8/3])

dict(list(df.groupby(pd.Series(group, index=df.index))))

可能的输出（字典中的每个值都是一个DataFrame）：

{'control':    Score
 2   0.21
 5   0.15,
 'treatment 1':    Score
 7   0.47,
 'treatment 2':    Score
 1   0.67
 3   0.05,
 'treatment 3':    Score
 0   0.55
 4   0.91
 6   0.33}

Answer 2

生成数字：

import random
randomlist = []
for i in range(0,10):
    n = random.uniform(0,1)
    randomlist.append(n)

randomlist

分成块：- 所以在这种情况下：

categories = 4;
length = round(len(randomlist)/categories)

chunks = [randomlist[x:x+length] for x in range(0, len(randomlist), length)]

Answer 3

我使用列表只是为了说明。对于每个数字，您都掷一个五面骰子，如果它是 1，它就会受到控制。如果它不是 1，则您掷一个 3 面骰子（是的，可能没有这样的东西；）），这决定了治疗组。

import random
list = [0.23, 0.034, 0.35, 0.75, 0.92, 0.25, 0.9]   
control = []
treatment1 = []
treatment2 = []
treatment3 = []
for score in list:
    dice = random.randint(1,5)
    print(dice, 'is dice')
    if dice == 1:
        control.append(score)
    else:
        seconddice = random.randint(1,3)
        print(seconddice, 'is second dice')
        if seconddice == 1:
            treatment1.append(score)
        elif seconddice == 2:
            treatment2.append(score)
        else: # seconddice == 3:
            treatment3.append(score)
    
print(control, 'is control')
print(treatment1, 'is treatment1')
print('and so on')

我做了一个简短的测试列表，结果是

5 is dice
1 is second dice
1 is dice
4 is dice
3 is second dice
5 is dice
2 is second dice
1 is dice
5 is dice
1 is second dice
3 is dice
1 is second dice
[0.034, 0.92] is control
[0.23, 0.25, 0.9] is treatment1
and so on

数据集越大，分布越好。

将模型分数随机抽样为 4 组，分布相似 python

Random sample the model scores into 4 groups with a similar distribution in python

python

distribution

pandas