连接属于同一组的不同值
Concatenate different values that belong to the same group
我有这样一个数据框
import pandas as pd
data = [
['ACOT', '00001', '', '', 1.5, 20, 30, 'AA'],
['ACOT', '00002', '', '', 1.7, 20, 33,'BB'],
['ACOT', '00003', '','NA_0001' ,1.4, 20, 40,'AA'],
['PAN', '000090', 'canonical', '', 0.5, 10, 30,'DD'],
['PAN', '000091', '', '', 0.4, 10, 30,'CC'],
['TOM', '000080', 'canonical', '', 0.4, 10, 15,'EE'],
['TOM', '000040', '', '', 1.7, 10, 300,'EE']
]
df = pd.DataFrame(data, columns=[
'Gene_name', 'Transcript_ID', 'canonical', 'mane', 'metrics','start','end', 'Example_extra_col'])
Gene_name Transcript_ID canonical mane metrics start end Example_extra_col
0 ACOT 00001 1.5 20 30 AA
1 ACOT 00002 NA_0001 1.7 20 33 BB
2 ACOT 00003 1.4 20 40 AA
3 PAN 000090 canonical NA_00090 0.5 10 30 DD
4 PAN 000091 0.4 10 30 CC
5 TOM 000080 canonical 0.4 10 15 EE
6 TOM 000040 1.7 10 300 EE
而且我正在减少行,尽量不丢失这些行的数据
out = (df
.groupby('Gene_name', as_index=False)
.agg({'canonical': 'any',
'mane': 'any',
'metrics': lambda x: f'{x.min()}-{x.max()}',
'Example_extra_col': 'first', # Here is the one I want to change
})
.replace({True: 'Yes', False: 'No'})
)
但是,对于最后一列,如果属于 dame 组的值不同,我想连接数据
Gene_name canonical mane metrics Example_extra_col
0 ACOT No Yes 1.4-1.7 AA,BB
1 PAN Yes No 0.4-0.5 DD,CC
2 TOM Yes No 0.4-1.7 EE
如何使用 .gg
执行此操作?
试试这个:
# Custom aggregation function
func = lambda x: ", ".join([y for y in x.fillna("").unique() if y])
(
df.groupby("Gene_name", as_index=False)
.agg(
{
"canonical": "any",
"mane": "any",
"metrics": lambda x: f"{x.min()}-{x.max()}",
"Example_extra_col": func,
}
)
.replace({True: "Yes", False: "No"})
)
(编辑以支持列中的 NAN)
我有这样一个数据框
import pandas as pd
data = [
['ACOT', '00001', '', '', 1.5, 20, 30, 'AA'],
['ACOT', '00002', '', '', 1.7, 20, 33,'BB'],
['ACOT', '00003', '','NA_0001' ,1.4, 20, 40,'AA'],
['PAN', '000090', 'canonical', '', 0.5, 10, 30,'DD'],
['PAN', '000091', '', '', 0.4, 10, 30,'CC'],
['TOM', '000080', 'canonical', '', 0.4, 10, 15,'EE'],
['TOM', '000040', '', '', 1.7, 10, 300,'EE']
]
df = pd.DataFrame(data, columns=[
'Gene_name', 'Transcript_ID', 'canonical', 'mane', 'metrics','start','end', 'Example_extra_col'])
Gene_name Transcript_ID canonical mane metrics start end Example_extra_col
0 ACOT 00001 1.5 20 30 AA
1 ACOT 00002 NA_0001 1.7 20 33 BB
2 ACOT 00003 1.4 20 40 AA
3 PAN 000090 canonical NA_00090 0.5 10 30 DD
4 PAN 000091 0.4 10 30 CC
5 TOM 000080 canonical 0.4 10 15 EE
6 TOM 000040 1.7 10 300 EE
而且我正在减少行,尽量不丢失这些行的数据
out = (df
.groupby('Gene_name', as_index=False)
.agg({'canonical': 'any',
'mane': 'any',
'metrics': lambda x: f'{x.min()}-{x.max()}',
'Example_extra_col': 'first', # Here is the one I want to change
})
.replace({True: 'Yes', False: 'No'})
)
但是,对于最后一列,如果属于 dame 组的值不同,我想连接数据
Gene_name canonical mane metrics Example_extra_col
0 ACOT No Yes 1.4-1.7 AA,BB
1 PAN Yes No 0.4-0.5 DD,CC
2 TOM Yes No 0.4-1.7 EE
如何使用 .gg
执行此操作?
试试这个:
# Custom aggregation function
func = lambda x: ", ".join([y for y in x.fillna("").unique() if y])
(
df.groupby("Gene_name", as_index=False)
.agg(
{
"canonical": "any",
"mane": "any",
"metrics": lambda x: f"{x.min()}-{x.max()}",
"Example_extra_col": func,
}
)
.replace({True: "Yes", False: "No"})
)
(编辑以支持列中的 NAN)