将一定数量的变量从一组添加到另一组
Add certain number of variables from one group to another
我有一个 pandas 数据框,我在其中将相同 type
的 object
分成一定数量的组(例如,3)。例如,组 ball_1
包含 3 个来自同一类型的唯一对象:soccer
、basket
和 bouncy
。其余对象进入组 ball_2
,在这种情况下,只有 1 个对象 tennis
.
对于包含少于 3 个唯一对象的组,我想用第一组的前 k 个唯一对象填充它们。例如,组 ball_2
将填充 tennis
,然后是组 ball_1
中的 soccer
和 basket
。因此,目标是所有组都拥有相同数量的唯一对象。
# chunk into groups of 3
N = 3
g = df.groupby('type')['object'].transform(lambda x: pd.factorize(x)[0]) // N + 1
df['group'] = df['type'].str.cat(g.astype(str), '_')
# identify which groups need more objects
for name, batch in df.groupby(['group']):
subset = df[df.group.isin([name])]
batch = batch.assign(check = subset['object'].nunique() < 3)
batch = batch.assign(need = 3 - subset['object'].nunique())
needmore = batch.loc[batch['check'] == True]
if needmore.empty:
continue
print('{} needs {} more objects'.format(batch['group'].unique(), batch['need'].unique()))
当前 df(这个玩具数据集有选定的列,但真实数据集有更多列)
type object index group
0 ball soccer 1 ball_1
1 ball soccer 2 ball_1
2 ball basket 1 ball_1
3 ball bouncy 1 ball_1
4 ball tennis 1 ball_2
5 ball tennis 2 ball_2
6 chair office 1 chair_1
7 chair office 2 chair_1
8 chair office 3 chair_1
9 chair lounge 1 chair_1
10 chair dining 1 chair_1
... ... ... ......
所需的 df(已将对象添加到组 ball_2
)
type object index group
0 ball soccer 1 ball_1
1 ball soccer 2 ball_1
2 ball basket 1 ball_1
3 ball bouncy 1 ball_1
4 ball tennis 1 ball_2
5 ball tennis 2 ball_2
6 ball soccer 1 ball_2
7 ball soccer 2 ball_2
8 ball basket 1 ball_2
9 chair office 1 chair_1
10 chair office 2 chair_1
11 chair office 3 chair_1
12 chair lounge 1 chair_1
13 chair dining 1 chair_1
... ... ... ......
你可以试试这个:
def addfisrtgroup(x):
missing=np.arange(3-x.nunique().object)
typegroup=x.iloc[0,0]
msk=np.isin(df.loc[df.group.eq(f'{typegroup}_1')].object.factorize()[0],missing)
return pd.concat([x,df.loc[df.group.eq(f'{typegroup}_1')][msk]])
temp=df.groupby('group')
.apply(lambda x: addfirstgroup(x) if x.nunique().object<3 else x)
.drop(columns='group')
groups=temp.index.get_level_values(0).to_frame().reset_index(drop=True)
pd.concat([temp.reset_index(drop=True), groups],1)
输出:
type object index group
0 ball soccer 1 ball_1
1 ball soccer 2 ball_1
2 ball basket 1 ball_1
3 ball bouncy 1 ball_1
4 ball tennis 1 ball_2
5 ball tennis 2 ball_2
6 ball soccer 1 ball_2
7 ball soccer 2 ball_2
8 ball basket 1 ball_2
9 chair office 1 chair_1
10 chair office 2 chair_1
11 chair office 3 chair_1
12 chair lounge 1 chair_1
13 chair dining 1 chair_1
我有一个 pandas 数据框,我在其中将相同 type
的 object
分成一定数量的组(例如,3)。例如,组 ball_1
包含 3 个来自同一类型的唯一对象:soccer
、basket
和 bouncy
。其余对象进入组 ball_2
,在这种情况下,只有 1 个对象 tennis
.
对于包含少于 3 个唯一对象的组,我想用第一组的前 k 个唯一对象填充它们。例如,组 ball_2
将填充 tennis
,然后是组 ball_1
中的 soccer
和 basket
。因此,目标是所有组都拥有相同数量的唯一对象。
# chunk into groups of 3
N = 3
g = df.groupby('type')['object'].transform(lambda x: pd.factorize(x)[0]) // N + 1
df['group'] = df['type'].str.cat(g.astype(str), '_')
# identify which groups need more objects
for name, batch in df.groupby(['group']):
subset = df[df.group.isin([name])]
batch = batch.assign(check = subset['object'].nunique() < 3)
batch = batch.assign(need = 3 - subset['object'].nunique())
needmore = batch.loc[batch['check'] == True]
if needmore.empty:
continue
print('{} needs {} more objects'.format(batch['group'].unique(), batch['need'].unique()))
当前 df(这个玩具数据集有选定的列,但真实数据集有更多列)
type object index group
0 ball soccer 1 ball_1
1 ball soccer 2 ball_1
2 ball basket 1 ball_1
3 ball bouncy 1 ball_1
4 ball tennis 1 ball_2
5 ball tennis 2 ball_2
6 chair office 1 chair_1
7 chair office 2 chair_1
8 chair office 3 chair_1
9 chair lounge 1 chair_1
10 chair dining 1 chair_1
... ... ... ......
所需的 df(已将对象添加到组 ball_2
)
type object index group
0 ball soccer 1 ball_1
1 ball soccer 2 ball_1
2 ball basket 1 ball_1
3 ball bouncy 1 ball_1
4 ball tennis 1 ball_2
5 ball tennis 2 ball_2
6 ball soccer 1 ball_2
7 ball soccer 2 ball_2
8 ball basket 1 ball_2
9 chair office 1 chair_1
10 chair office 2 chair_1
11 chair office 3 chair_1
12 chair lounge 1 chair_1
13 chair dining 1 chair_1
... ... ... ......
你可以试试这个:
def addfisrtgroup(x):
missing=np.arange(3-x.nunique().object)
typegroup=x.iloc[0,0]
msk=np.isin(df.loc[df.group.eq(f'{typegroup}_1')].object.factorize()[0],missing)
return pd.concat([x,df.loc[df.group.eq(f'{typegroup}_1')][msk]])
temp=df.groupby('group')
.apply(lambda x: addfirstgroup(x) if x.nunique().object<3 else x)
.drop(columns='group')
groups=temp.index.get_level_values(0).to_frame().reset_index(drop=True)
pd.concat([temp.reset_index(drop=True), groups],1)
输出:
type object index group
0 ball soccer 1 ball_1
1 ball soccer 2 ball_1
2 ball basket 1 ball_1
3 ball bouncy 1 ball_1
4 ball tennis 1 ball_2
5 ball tennis 2 ball_2
6 ball soccer 1 ball_2
7 ball soccer 2 ball_2
8 ball basket 1 ball_2
9 chair office 1 chair_1
10 chair office 2 chair_1
11 chair office 3 chair_1
12 chair lounge 1 chair_1
13 chair dining 1 chair_1