减少列中具有重复值的行,并以不同方式汇总列的其余部分,从而保持列数
Reduce rows that has a repeated values in a column and summarises rest of column in different ways maintaining the number of columns
我有这个数据框
import pandas as pd
data = [
['ACOT', '00001', '', '', 1.5, 20, 30, 'col1ACOT'],
['ACOT', '00002', '', '', 1.7, 20, 33,'col1ACOT'],
['ACOT', '00003', '','NA_0001' ,1.4, 20, 40,'col1ACOT'],
['PAN', '000090', 'canonical', '', 0.5, 10, 30,'col1PAN'],
['PAN', '000091', '', '', 0.4, 10, 30,'col1PAN'],
['TOM', '000080', 'canonical', '', 0.4, 10, 15,'col1TOM'],
['TOM', '000040', '', '', 1.7, 10, 300,'col1TOM']
]
df = pd.DataFrame(data, columns=[
'Gene_name', 'Transcript_ID', 'canonical', 'mane', 'metrics','start','end', 'Example_extra_col'])
Gene_name Transcript_ID canonical mane metrics start end Example_extra_col
0 ACOT 00001 1.5 20 30 col1ACOT
1 ACOT 00002 NA_0001 1.7 20 33 col1ACOT
2 ACOT 00003 1.4 20 40 col1ACOT
3 PAN 000090 canonical NA_00090 0.5 10 30 col1PAN
4 PAN 000091 0.4 10 30 col1PAN
5 TOM 000080 canonical 0.4 10 15 col1TOM
6 TOM 000040 1.7 10 300 col1TOM
我想要这个输出
Gene_name canonical mane metrics Example_extra_col
0 ACOT No Yes 1.4-1.5 col1ACOT
4 PAN Yes Yes 0.5-0.4 col1PAN
5 TOM Yes No 1.7-0.4 col1TOM
部分地,我可以用这些行来做到这一点
f = lambda x: "Yes" if x.any() else "No" # For canonical and mane
df = df.groupby('Gene_name').agg({'canonical': f, 'mane': f, 'metrics': ['min', 'max']})
canonical mane metrics_min metrics_max
Gene_name
ACOT No Yes 1.4 1.7
PAN Yes Yes 0.4 0.5
TOM Yes No 0.4 1.7
但是我丢失了 Example_extra_col(以及后面的),因为我的真实数据框有更多的列。
我怎样才能做到这一点?
你可以用GroupBy.agg
做任何事情:
out = (df
.groupby('Gene_name', as_index=False)
.agg({'canonical': 'any',
'mane': 'any',
'metrics': lambda x: f'{x.min()}-{x.max()}',
'Example_extra_col': 'first',
})
.replace({True: 'Yes', False: 'No'})
)
输出:
Gene_name canonical mane metrics Example_extra_col
0 ACOT No Yes 1.4-1.7 col1ACOT
1 PAN Yes No 0.4-0.5 col1PAN
2 TOM Yes No 0.4-1.7 col1TOM
如果您需要自定义输出列名称,则使用名称聚合替代:
(df
.groupby('Gene_name')
.agg(**{'canonical': ('canonical', 'any'),
'mane': ('mane', 'any'),
'metrics': ('metrics', lambda x: f'{x.min()}-{x.max()}'),
'Example_extra_col': ('Example_extra_col', 'first')
})
)
我有这个数据框
import pandas as pd
data = [
['ACOT', '00001', '', '', 1.5, 20, 30, 'col1ACOT'],
['ACOT', '00002', '', '', 1.7, 20, 33,'col1ACOT'],
['ACOT', '00003', '','NA_0001' ,1.4, 20, 40,'col1ACOT'],
['PAN', '000090', 'canonical', '', 0.5, 10, 30,'col1PAN'],
['PAN', '000091', '', '', 0.4, 10, 30,'col1PAN'],
['TOM', '000080', 'canonical', '', 0.4, 10, 15,'col1TOM'],
['TOM', '000040', '', '', 1.7, 10, 300,'col1TOM']
]
df = pd.DataFrame(data, columns=[
'Gene_name', 'Transcript_ID', 'canonical', 'mane', 'metrics','start','end', 'Example_extra_col'])
Gene_name Transcript_ID canonical mane metrics start end Example_extra_col
0 ACOT 00001 1.5 20 30 col1ACOT
1 ACOT 00002 NA_0001 1.7 20 33 col1ACOT
2 ACOT 00003 1.4 20 40 col1ACOT
3 PAN 000090 canonical NA_00090 0.5 10 30 col1PAN
4 PAN 000091 0.4 10 30 col1PAN
5 TOM 000080 canonical 0.4 10 15 col1TOM
6 TOM 000040 1.7 10 300 col1TOM
我想要这个输出
Gene_name canonical mane metrics Example_extra_col
0 ACOT No Yes 1.4-1.5 col1ACOT
4 PAN Yes Yes 0.5-0.4 col1PAN
5 TOM Yes No 1.7-0.4 col1TOM
部分地,我可以用这些行来做到这一点
f = lambda x: "Yes" if x.any() else "No" # For canonical and mane
df = df.groupby('Gene_name').agg({'canonical': f, 'mane': f, 'metrics': ['min', 'max']})
canonical mane metrics_min metrics_max
Gene_name
ACOT No Yes 1.4 1.7
PAN Yes Yes 0.4 0.5
TOM Yes No 0.4 1.7
但是我丢失了 Example_extra_col(以及后面的),因为我的真实数据框有更多的列。
我怎样才能做到这一点?
你可以用GroupBy.agg
做任何事情:
out = (df
.groupby('Gene_name', as_index=False)
.agg({'canonical': 'any',
'mane': 'any',
'metrics': lambda x: f'{x.min()}-{x.max()}',
'Example_extra_col': 'first',
})
.replace({True: 'Yes', False: 'No'})
)
输出:
Gene_name canonical mane metrics Example_extra_col
0 ACOT No Yes 1.4-1.7 col1ACOT
1 PAN Yes No 0.4-0.5 col1PAN
2 TOM Yes No 0.4-1.7 col1TOM
如果您需要自定义输出列名称,则使用名称聚合替代:
(df
.groupby('Gene_name')
.agg(**{'canonical': ('canonical', 'any'),
'mane': ('mane', 'any'),
'metrics': ('metrics', lambda x: f'{x.min()}-{x.max()}'),
'Example_extra_col': ('Example_extra_col', 'first')
})
)