随机将数据帧分成具有均匀分布值的组

Question

我有两个组（A 和 B）的数据框，在这些组中，有 6 个子组（a、b、c 、d、e 和 f）。以下示例数据：

index   group    subgroup    value
0       A        a           1
1       A        b           1
2       A        c           1
3       A        d           1
4       A        e           1
5       A        f           1
6       B        a           1
7       B        b           1
8       B        c           1
9       B        d           1
10      B        e           1
11      B        f           1
...     ...      ...         ...

虽然我在这里只列出了 12 行等于 1 的值，但实际上在真实数据集中有 300 行（值等于 2、3 等）。我正在尝试将数据框随机分成 6 批，每批 50 个值。但是，我希望每个批次都包含均匀分布的 group 值（因此 25 个 A 和 25 个 B）和近似均匀分布的 subgroup 值。

例如，batch_1 可能包含：

25 A's 其中包含 4 a's, 5 b's, 4 c's, 4 d's , 5 e 和 3 f。以及 25 个 B，其中包含 5 个 a、4 个 b、3 个 c、5 个 d、4 个 e 和 4 f。

这6批将分给1位用户。（所以我实际上需要将数据帧随机拆分为多个 different 6 批次以供更多用户使用。）但我不知道这是应该随机拆分数据帧还是从中采样。有人对如何实现这一目标有建议吗？

这可能会有帮助，但不能确保值的均匀分布：https://www.geeksforgeeks.org/break-list-chunks-size-n-python/

Answer 1

使用一些技巧

使用 pd.factorize() 将分类数据转换为每个类别的值
计算一个value/factor f表示一对组/子组
稍微随机化一下np.random.uniform()，最小值和最大值接近 1
一旦有了代表分组的值，sort_values()和reset_index()就可以有一个干净有序的索引
终于算出整数余数分组

group = list("ABCD")
subgroup = list("abcdef")
df = pd.DataFrame([{"group":group[random.randint(0,len(group)-1)], 
 "subgroup":subgroup[random.randint(0,len(subgroup)-1)],
 "value":random.randint(1,3)} for i in range(300)])

bins=6
dfc = df.assign(
    # take into account concentration of group and subgroup
    # randomise a bit....
    f = ((pd.factorize(df["group"])[0] +1)*10 + 
            (pd.factorize(df["subgroup"])[0] +1) 
            *np.random.uniform(0.99,1.01,len(df))
        ),
).sort_values("f").reset_index(drop=True).assign(
    gc=lambda dfa: dfa.index%(bins)
).drop(columns="f")

# check distribution ... used plot for SO
dfc.groupby(["gc","group","subgroup"]).count().unstack(0).plot(kind="barh")
# every group same size...
# dfc.groupby("gc").count()

# now it's easy to get each of the cuts.... 0 through 5
# dfcut0 = dfc.query("gc==0").drop(columns="gc").copy().reset_index(drop=True)
# dfcut0

输出

随机将数据帧分成具有均匀分布值的组

randomly split dataframe into groups with even distribution of values

python

dataframe

pandas

data-wrangling