从不同的数据框获取数据

Getting data from different dataframe

我有一个数据框

Name    Subset    Type    System
A00     IU00-A    OP      A
A00     IT00      PP      A
B01     IT-01A    PP      B
B01     IU        OP      B
B03     IM-09-B   LP      A
B03     IM03A     OP      A
B03     IT-09     OP      A
D09     IT        OP      A
D09     IM        LP      A
D09     IM        OP      A

我已将其转换为

Subset Cluster    Type Cluster    Name          System
IU,IT             OP,PP           A00           A
IM,IM,IT          LP, OP, OP      B03, D09      A
IU,IT             OP,PP           B01           B

使用

out = df.assign(Subset=df['Subset'].str[:2])\
        .sort_values(by=df.columns.tolist())\
        .groupby('Name', as_index=False)\
        .agg(**{'Subset Cluster': ('Subset', ', '.join), 
                'Type Cluster': ('Type', ', '.join), 
                'System': ('System', 'first')})\
        .groupby(['Subset Cluster', 'Type Cluster', 'System'], as_index=False)\
        .agg(', '.join)

在这个转换后的数据框中,我需要添加另一列,该列将为我提供特定名称的所有子集。

输出示例:

Subset Cluster    Type Cluster    Name          System    Subsets
IU,IT             OP,PP           A00           A         IU00-A,IT00
IM,IM,IT          LP, OP, OP      B03, D09      A         IM-09-B,IM03A,IT-09,IT,IM,IM   
IU,IT             OP,PP           B01           B         IT-01A,IU

使用:

s = """Name    Subset    Type    System
A00     IU00-A    OP      A
A00     IT00      PP      A
B01     IT-01A    PP      B
B01     IU        OP      B
B03     IM-09-B   LP      A
B03     IM03A     OP      A
B03     IT-09     OP      A
D09     IT        OP      A
D09     IM        LP      A
D09     IM        OP      A"""

temp = [x.split() for x in s.split('\n')]
cols = temp[0]
data = temp[1:]
df = pd.DataFrame(data, columns = cols)

df1 = pd.DataFrame({'Name':['A00', 'B03, D09', 'B01']})

vals = []
for val in df1['Name']:
    t = val.replace(', ', '|')
    vals.append(df[df['Name'].str.contains(t)]['Subset'].values)
    
df1['Subsets']=vals

输出:

我们可以先赋值Subset Cluster;然后使用双 groupby:

out = df.assign(**{'Subset Cluster': df['Subset'].str[:2]})\
        .sort_values(by=df.columns.tolist())\
        .groupby(['Name', 'System'], as_index=False)\
        .agg(', '.join).rename(columns={'Type':'Type Cluster'})\
        .groupby(['Subset Cluster', 'Type Cluster', 'System'], as_index=False)\
        .agg(', '.join)

输出:

  Subset Cluster Type Cluster System      Name                             Subset
0     IM, IM, IT   LP, OP, OP      A  B03, D09  IM-09-B, IM03A, IT-09, IM, IM, IT
1         IT, IU       PP, OP      A       A00                       IT00, IU00-A
2         IT, IU       PP, OP      B       B01                         IT-01A, IU