将变量随机分块到一定数量的组

Question

我有一个很大的 pandas 数据框，我试图在其中随机将对象分成一定数量的组。例如，我试图将下面的 object 分成 3 组。但是，组必须来自相同的 type。这是一个玩具数据集：

type     object       index

ball     soccer       1
ball     soccer       2
ball     basket       1
ball     bouncy       1
ball     tennis       1
ball     tennis       2
chair    office       1
chair    office       2
chair    office       3
chair    lounge       1
chair    dining       1
chair    dining       2
...      ...          ...

期望的输出：

type     object       index    group

ball     soccer       1        ball_1
ball     soccer       2        ball_1
ball     basket       1        ball_1
ball     bouncy       1        ball_1
ball     tennis       1        ball_2
ball     tennis       2        ball_2
chair    office       1        chair_1
chair    office       2        chair_1
chair    office       3        chair_1
chair    lounge       1        chair_1
chair    dining       1        chair_1
chair    dining       2        chair_1
...      ...          ...      ...

因此，组 ball_1 包含 3 个来自同一类型的独特对象：soccer、basket 和 bouncy。剩余对象进入只有 1 个对象的组 ball_2。由于数据框太大，我希望得到一长串包含 3 个对象的组和一个包含剩余对象（少于 3 个）的组。

同样，虽然我的示例只包含几个对象，但我希望这些对象随机分为 3 个一组。（我的真实数据集将包含更多球和椅子。）

这似乎很有帮助，但我还没有想出如何应用它：How do you split a list into evenly sized chunks?

Answer 1

如果需要按 type 对每个组的每个 N 值进行拆分，则可以使用 factorize with GroupBy.transform, integer divide and add 1, last add column type in Series.str.cat:

N = 3
g = df.groupby('type')['object'].transform(lambda x: pd.factorize(x)[0]) // N + 1

df['group'] = df['type'].str.cat(g.astype(str), '_')
print (df)
     type  object  index    group
0    ball  soccer      1   ball_1
1    ball  soccer      2   ball_1
2    ball  basket      1   ball_1
3    ball  bouncy      1   ball_1
4    ball  tennis      1   ball_2
5    ball  tennis      2   ball_2
6   chair  office      1  chair_1
7   chair  office      2  chair_1
8   chair  office      3  chair_1
9   chair  lounge      1  chair_1
10  chair  dining      1  chair_1

如果还需要一些随机值，请添加 DataFrame.sample:

N = 3
df = df.sample(frac=1)
g = df.groupby('type')['object'].transform(lambda x: pd.factorize(x)[0]) // N + 1

df['group'] = df['type'].str.cat(g.astype(str), '_')
print (df)
     type  object  index    group
10  chair  dining      1  chair_1
8   chair  office      3  chair_1
2    ball  basket      1   ball_1
1    ball  soccer      2   ball_1
7   chair  office      2  chair_1
0    ball  soccer      1   ball_1
9   chair  lounge      1  chair_1
4    ball  tennis      1   ball_1
6   chair  office      1  chair_1
3    ball  bouncy      1   ball_2
5    ball  tennis      2   ball_1

将变量随机分块到一定数量的组

Randomly chunk variables to groups of a certain number

python

dataframe

pandas

data-wrangling