如何将随机元素添加到数据框的一列（平均分配给组）

Question

假设我有以下数据框：

Type    Name
S2019   John
S2019   Stephane
S2019   Mike
S2019   Hamid
S2021   Rahim
S2021   Ahamed

我想根据“Type”对数据集进行分组，然后添加一个名为“Sampled”的新列，并随机添加 yes/no 到每一行，yes/no 应该平均分配。预期的数据帧可以是：

Type    Name    Sampled
S2019   John    no
S2019   Stephane    yes
S2019   Mike    yes
S2019   Hamid   no
S2021   Rahim   yes
S2021   Ahamed  no

Answer 1

您可以使用 numpy.random.choice:

import numpy as np
df['Sampled'] = np.random.choice(['yes', 'no'], size=len(df))

输出：

    Type      Name Sampled
0  S2019      John      no
1  S2019  Stephane      no
2  S2019      Mike     yes
3  S2019     Hamid      no
4  S2021     Rahim      no
5  S2021    Ahamed     yes

每组概率相等：

df['Sampled'] = (df.groupby('Type')['Type']
                   .transform(lambda g: np.random.choice(['yes', 'no'],
                                                         size=len(g)))
                )

对于每个组，取一个任意列（这里是Type，不过没关系，这只是为了有1的形状），并以组的长度为参数应用np.random.choice .这会以相同的概率给出与组中项目数一样多的是或否（请注意，如果需要，您可以为每个项目定义特定的概率）。

注意。等概率不意味着你一定会得到 yes/no 的 50/50，如果这是你想要的，请澄清

每组减半yes/no

如果你想要每种一半（yes/no）（奇数大小为 ±1），你可以 select 随机索引的一半。

idx = df.groupby('Type', group_keys=False).apply(lambda g: g.sample(n=len(g)//2)).index

df['Sampled'] = np.where(df.index.isin(idx), 'yes', 'no')

注意。如果是奇数，np.where函数中定义的第二项会多一个，这里是“no”。

平均分配多个元素：

这将在多重性的限制下平均分配。这意味着，对于 3 个元素和 4 个位置，将随机出现两个 a、一个 b、一个 c。如果您希望随机选择额外的项目，请先打乱输入。

elem = ['a', 'b', 'c']
df['Sampled'] = (df
.groupby('Type', group_keys=False)['Type']
.transform(lambda g: np.random.choice(np.tile(elem, int(np.ceil(len(g)/len(elem))))[:len(g)],
                                      size=len(g), replace=False))
)

输出：

    Type      Name Sampled
0  S2019      John       a
1  S2019  Stephane       a
2  S2019      Mike       b
3  S2019     Hamid       c
4  S2021     Rahim       a
5  S2021    Ahamed       b

Answer 2

在GroupBy.transform with create helper array arr by equally distibuted values yes, no and then randomize order by numpy.random.shuffle中使用自定义函数：

def f(x):
    arr = np.full(len(x), ['no'], dtype=object)
    arr[:int(len(x) * 0.5)] = 'yes'
    np.random.shuffle(arr)
    return arr

df['Sampled'] = df.groupby('Type')['Name'].transform(f)
print (df)
    Type      Name Sampled
0  S2019      John     yes
1  S2019  Stephane      no
2  S2019      Mike      no
3  S2019     Hamid     yes
4  S2021     Rahim      no
5  S2021    Ahamed     yes

如何将随机元素添加到数据框的一列（平均分配给组）

How to add randomly elements to a column of dataframe (Equally distributed to groups)

python

dataframe

pandas

每组概率相等：

每组减半yes/no

平均分配多个元素：