从不同的数据框获取数据
Getting data from different dataframe
我有一个数据框
Name Subset Type System
A00 IU00-A OP A
A00 IT00 PP A
B01 IT-01A PP B
B01 IU OP B
B03 IM-09-B LP A
B03 IM03A OP A
B03 IT-09 OP A
D09 IT OP A
D09 IM LP A
D09 IM OP A
我已将其转换为
Subset Cluster Type Cluster Name System
IU,IT OP,PP A00 A
IM,IM,IT LP, OP, OP B03, D09 A
IU,IT OP,PP B01 B
使用
out = df.assign(Subset=df['Subset'].str[:2])\
.sort_values(by=df.columns.tolist())\
.groupby('Name', as_index=False)\
.agg(**{'Subset Cluster': ('Subset', ', '.join),
'Type Cluster': ('Type', ', '.join),
'System': ('System', 'first')})\
.groupby(['Subset Cluster', 'Type Cluster', 'System'], as_index=False)\
.agg(', '.join)
在这个转换后的数据框中,我需要添加另一列,该列将为我提供特定名称的所有子集。
输出示例:
Subset Cluster Type Cluster Name System Subsets
IU,IT OP,PP A00 A IU00-A,IT00
IM,IM,IT LP, OP, OP B03, D09 A IM-09-B,IM03A,IT-09,IT,IM,IM
IU,IT OP,PP B01 B IT-01A,IU
使用:
s = """Name Subset Type System
A00 IU00-A OP A
A00 IT00 PP A
B01 IT-01A PP B
B01 IU OP B
B03 IM-09-B LP A
B03 IM03A OP A
B03 IT-09 OP A
D09 IT OP A
D09 IM LP A
D09 IM OP A"""
temp = [x.split() for x in s.split('\n')]
cols = temp[0]
data = temp[1:]
df = pd.DataFrame(data, columns = cols)
df1 = pd.DataFrame({'Name':['A00', 'B03, D09', 'B01']})
vals = []
for val in df1['Name']:
t = val.replace(', ', '|')
vals.append(df[df['Name'].str.contains(t)]['Subset'].values)
df1['Subsets']=vals
输出:
我们可以先赋值Subset Cluster
;然后使用双 groupby
:
out = df.assign(**{'Subset Cluster': df['Subset'].str[:2]})\
.sort_values(by=df.columns.tolist())\
.groupby(['Name', 'System'], as_index=False)\
.agg(', '.join).rename(columns={'Type':'Type Cluster'})\
.groupby(['Subset Cluster', 'Type Cluster', 'System'], as_index=False)\
.agg(', '.join)
输出:
Subset Cluster Type Cluster System Name Subset
0 IM, IM, IT LP, OP, OP A B03, D09 IM-09-B, IM03A, IT-09, IM, IM, IT
1 IT, IU PP, OP A A00 IT00, IU00-A
2 IT, IU PP, OP B B01 IT-01A, IU
我有一个数据框
Name Subset Type System
A00 IU00-A OP A
A00 IT00 PP A
B01 IT-01A PP B
B01 IU OP B
B03 IM-09-B LP A
B03 IM03A OP A
B03 IT-09 OP A
D09 IT OP A
D09 IM LP A
D09 IM OP A
我已将其转换为
Subset Cluster Type Cluster Name System
IU,IT OP,PP A00 A
IM,IM,IT LP, OP, OP B03, D09 A
IU,IT OP,PP B01 B
使用
out = df.assign(Subset=df['Subset'].str[:2])\
.sort_values(by=df.columns.tolist())\
.groupby('Name', as_index=False)\
.agg(**{'Subset Cluster': ('Subset', ', '.join),
'Type Cluster': ('Type', ', '.join),
'System': ('System', 'first')})\
.groupby(['Subset Cluster', 'Type Cluster', 'System'], as_index=False)\
.agg(', '.join)
在这个转换后的数据框中,我需要添加另一列,该列将为我提供特定名称的所有子集。
输出示例:
Subset Cluster Type Cluster Name System Subsets
IU,IT OP,PP A00 A IU00-A,IT00
IM,IM,IT LP, OP, OP B03, D09 A IM-09-B,IM03A,IT-09,IT,IM,IM
IU,IT OP,PP B01 B IT-01A,IU
使用:
s = """Name Subset Type System
A00 IU00-A OP A
A00 IT00 PP A
B01 IT-01A PP B
B01 IU OP B
B03 IM-09-B LP A
B03 IM03A OP A
B03 IT-09 OP A
D09 IT OP A
D09 IM LP A
D09 IM OP A"""
temp = [x.split() for x in s.split('\n')]
cols = temp[0]
data = temp[1:]
df = pd.DataFrame(data, columns = cols)
df1 = pd.DataFrame({'Name':['A00', 'B03, D09', 'B01']})
vals = []
for val in df1['Name']:
t = val.replace(', ', '|')
vals.append(df[df['Name'].str.contains(t)]['Subset'].values)
df1['Subsets']=vals
输出:
我们可以先赋值Subset Cluster
;然后使用双 groupby
:
out = df.assign(**{'Subset Cluster': df['Subset'].str[:2]})\
.sort_values(by=df.columns.tolist())\
.groupby(['Name', 'System'], as_index=False)\
.agg(', '.join).rename(columns={'Type':'Type Cluster'})\
.groupby(['Subset Cluster', 'Type Cluster', 'System'], as_index=False)\
.agg(', '.join)
输出:
Subset Cluster Type Cluster System Name Subset
0 IM, IM, IT LP, OP, OP A B03, D09 IM-09-B, IM03A, IT-09, IM, IM, IT
1 IT, IU PP, OP A A00 IT00, IU00-A
2 IT, IU PP, OP B B01 IT-01A, IU