将一定数量的变量从一组添加到另一组

Question

我有一个 pandas 数据框，我在其中将相同 type 的 object 分成一定数量的组（例如，3）。例如，组 ball_1 包含 3 个来自同一类型的唯一对象：soccer、basket 和 bouncy。其余对象进入组 ball_2，在这种情况下，只有 1 个对象 tennis.

对于包含少于 3 个唯一对象的组，我想用第一组的前 k 个唯一对象填充它们。例如，组 ball_2 将填充 tennis，然后是组 ball_1 中的 soccer 和 basket。因此，目标是所有组都拥有相同数量的唯一对象。

# chunk into groups of 3
N = 3
g = df.groupby('type')['object'].transform(lambda x: pd.factorize(x)[0]) // N + 1
df['group'] = df['type'].str.cat(g.astype(str), '_')

# identify which groups need more objects
for name, batch in df.groupby(['group']):
    subset = df[df.group.isin([name])]
    batch = batch.assign(check = subset['object'].nunique() < 3)
    batch = batch.assign(need = 3 - subset['object'].nunique())
    needmore = batch.loc[batch['check'] == True]
    if needmore.empty:
          continue 
    print('{} needs {} more objects'.format(batch['group'].unique(), batch['need'].unique()))

当前 df（这个玩具数据集有选定的列，但真实数据集有更多列）

     type  object  index    group
0    ball  soccer      1   ball_1
1    ball  soccer      2   ball_1
2    ball  basket      1   ball_1
3    ball  bouncy      1   ball_1
4    ball  tennis      1   ball_2
5    ball  tennis      2   ball_2
6   chair  office      1  chair_1
7   chair  office      2  chair_1
8   chair  office      3  chair_1
9   chair  lounge      1  chair_1
10  chair  dining      1  chair_1
... ...    ...         ......

所需的 df（已将对象添加到组 ball_2）

     type  object  index    group
0    ball  soccer      1   ball_1
1    ball  soccer      2   ball_1
2    ball  basket      1   ball_1
3    ball  bouncy      1   ball_1
4    ball  tennis      1   ball_2
5    ball  tennis      2   ball_2
6    ball  soccer      1   ball_2
7    ball  soccer      2   ball_2
8    ball  basket      1   ball_2
9    chair office      1  chair_1
10   chair office      2  chair_1
11   chair office      3  chair_1
12   chair lounge      1  chair_1
13   chair dining      1  chair_1
... ...    ...         ......

Answer 1

你可以试试这个：

def addfisrtgroup(x):
    missing=np.arange(3-x.nunique().object)
    typegroup=x.iloc[0,0]
    msk=np.isin(df.loc[df.group.eq(f'{typegroup}_1')].object.factorize()[0],missing)
    return pd.concat([x,df.loc[df.group.eq(f'{typegroup}_1')][msk]])


temp=df.groupby('group')
       .apply(lambda x: addfirstgroup(x) if x.nunique().object<3 else x)
       .drop(columns='group')


groups=temp.index.get_level_values(0).to_frame().reset_index(drop=True)

pd.concat([temp.reset_index(drop=True), groups],1)

输出：

     type  object  index    group
0    ball  soccer      1   ball_1
1    ball  soccer      2   ball_1
2    ball  basket      1   ball_1
3    ball  bouncy      1   ball_1
4    ball  tennis      1   ball_2
5    ball  tennis      2   ball_2
6    ball  soccer      1   ball_2
7    ball  soccer      2   ball_2
8    ball  basket      1   ball_2
9   chair  office      1  chair_1
10  chair  office      2  chair_1
11  chair  office      3  chair_1
12  chair  lounge      1  chair_1
13  chair  dining      1  chair_1

将一定数量的变量从一组添加到另一组

Add certain number of variables from one group to another

python

dataframe

pandas

data-wrangling