将变量随机分块到一定数量的组
Randomly chunk variables to groups of a certain number
我有一个很大的 pandas 数据框,我试图在其中随机将对象分成一定数量的组。例如,我试图将下面的 object
分成 3 组。但是,组必须来自相同的 type
。这是一个玩具数据集:
type object index
ball soccer 1
ball soccer 2
ball basket 1
ball bouncy 1
ball tennis 1
ball tennis 2
chair office 1
chair office 2
chair office 3
chair lounge 1
chair dining 1
chair dining 2
... ... ...
期望的输出:
type object index group
ball soccer 1 ball_1
ball soccer 2 ball_1
ball basket 1 ball_1
ball bouncy 1 ball_1
ball tennis 1 ball_2
ball tennis 2 ball_2
chair office 1 chair_1
chair office 2 chair_1
chair office 3 chair_1
chair lounge 1 chair_1
chair dining 1 chair_1
chair dining 2 chair_1
... ... ... ...
因此,组 ball_1
包含 3 个来自同一类型的独特对象:soccer
、basket
和 bouncy
。剩余对象进入只有 1 个对象的组 ball_2
。由于数据框太大,我希望得到一长串包含 3 个对象的组和一个包含剩余对象(少于 3 个)的组。
同样,虽然我的示例只包含几个对象,但我希望这些对象 随机 分为 3 个一组。(我的真实数据集将包含更多球和椅子。)
这似乎很有帮助,但我还没有想出如何应用它:How do you split a list into evenly sized chunks?
如果需要按 type
对每个组的每个 N
值进行拆分,则可以使用 factorize
with GroupBy.transform
, integer divide and add 1
, last add column type
in Series.str.cat
:
N = 3
g = df.groupby('type')['object'].transform(lambda x: pd.factorize(x)[0]) // N + 1
df['group'] = df['type'].str.cat(g.astype(str), '_')
print (df)
type object index group
0 ball soccer 1 ball_1
1 ball soccer 2 ball_1
2 ball basket 1 ball_1
3 ball bouncy 1 ball_1
4 ball tennis 1 ball_2
5 ball tennis 2 ball_2
6 chair office 1 chair_1
7 chair office 2 chair_1
8 chair office 3 chair_1
9 chair lounge 1 chair_1
10 chair dining 1 chair_1
如果还需要一些随机值,请添加 DataFrame.sample
:
N = 3
df = df.sample(frac=1)
g = df.groupby('type')['object'].transform(lambda x: pd.factorize(x)[0]) // N + 1
df['group'] = df['type'].str.cat(g.astype(str), '_')
print (df)
type object index group
10 chair dining 1 chair_1
8 chair office 3 chair_1
2 ball basket 1 ball_1
1 ball soccer 2 ball_1
7 chair office 2 chair_1
0 ball soccer 1 ball_1
9 chair lounge 1 chair_1
4 ball tennis 1 ball_1
6 chair office 1 chair_1
3 ball bouncy 1 ball_2
5 ball tennis 2 ball_1
我有一个很大的 pandas 数据框,我试图在其中随机将对象分成一定数量的组。例如,我试图将下面的 object
分成 3 组。但是,组必须来自相同的 type
。这是一个玩具数据集:
type object index
ball soccer 1
ball soccer 2
ball basket 1
ball bouncy 1
ball tennis 1
ball tennis 2
chair office 1
chair office 2
chair office 3
chair lounge 1
chair dining 1
chair dining 2
... ... ...
期望的输出:
type object index group
ball soccer 1 ball_1
ball soccer 2 ball_1
ball basket 1 ball_1
ball bouncy 1 ball_1
ball tennis 1 ball_2
ball tennis 2 ball_2
chair office 1 chair_1
chair office 2 chair_1
chair office 3 chair_1
chair lounge 1 chair_1
chair dining 1 chair_1
chair dining 2 chair_1
... ... ... ...
因此,组 ball_1
包含 3 个来自同一类型的独特对象:soccer
、basket
和 bouncy
。剩余对象进入只有 1 个对象的组 ball_2
。由于数据框太大,我希望得到一长串包含 3 个对象的组和一个包含剩余对象(少于 3 个)的组。
同样,虽然我的示例只包含几个对象,但我希望这些对象 随机 分为 3 个一组。(我的真实数据集将包含更多球和椅子。)
这似乎很有帮助,但我还没有想出如何应用它:How do you split a list into evenly sized chunks?
如果需要按 type
对每个组的每个 N
值进行拆分,则可以使用 factorize
with GroupBy.transform
, integer divide and add 1
, last add column type
in Series.str.cat
:
N = 3
g = df.groupby('type')['object'].transform(lambda x: pd.factorize(x)[0]) // N + 1
df['group'] = df['type'].str.cat(g.astype(str), '_')
print (df)
type object index group
0 ball soccer 1 ball_1
1 ball soccer 2 ball_1
2 ball basket 1 ball_1
3 ball bouncy 1 ball_1
4 ball tennis 1 ball_2
5 ball tennis 2 ball_2
6 chair office 1 chair_1
7 chair office 2 chair_1
8 chair office 3 chair_1
9 chair lounge 1 chair_1
10 chair dining 1 chair_1
如果还需要一些随机值,请添加 DataFrame.sample
:
N = 3
df = df.sample(frac=1)
g = df.groupby('type')['object'].transform(lambda x: pd.factorize(x)[0]) // N + 1
df['group'] = df['type'].str.cat(g.astype(str), '_')
print (df)
type object index group
10 chair dining 1 chair_1
8 chair office 3 chair_1
2 ball basket 1 ball_1
1 ball soccer 2 ball_1
7 chair office 2 chair_1
0 ball soccer 1 ball_1
9 chair lounge 1 chair_1
4 ball tennis 1 ball_1
6 chair office 1 chair_1
3 ball bouncy 1 ball_2
5 ball tennis 2 ball_1