连接属于同一组的不同值

Concatenate different values that belong to the same group

我有这样一个数据框


import pandas as pd


data = [
    ['ACOT', '00001', '', '', 1.5, 20, 30, 'AA'],
    ['ACOT', '00002', '', '', 1.7, 20, 33,'BB'],
    ['ACOT', '00003', '','NA_0001' ,1.4, 20, 40,'AA'],
    ['PAN', '000090', 'canonical', '', 0.5, 10, 30,'DD'],
    ['PAN', '000091', '', '', 0.4, 10, 30,'CC'],
    ['TOM', '000080', 'canonical', '', 0.4, 10, 15,'EE'],
    ['TOM', '000040', '', '', 1.7, 10, 300,'EE']
]

df = pd.DataFrame(data, columns=[
    'Gene_name', 'Transcript_ID', 'canonical', 'mane', 'metrics','start','end', 'Example_extra_col'])



Gene_name   Transcript_ID   canonical   mane    metrics start   end Example_extra_col  
0   ACOT    00001                               1.5 20  30   AA
1   ACOT    00002                       NA_0001 1.7 20  33   BB
2   ACOT    00003                               1.4 20  40   AA
3   PAN     000090          canonical   NA_00090    0.5 10  30   DD
4   PAN     000091                              0.4 10  30   CC
5   TOM     000080          canonical           0.4 10  15   EE
6   TOM     000040                              1.7 10  300   EE

而且我正在减少行,尽量不丢失这些行的数据

out = (df
 .groupby('Gene_name', as_index=False)
 .agg({'canonical': 'any',
       'mane': 'any',
       'metrics': lambda x: f'{x.min()}-{x.max()}',
       'Example_extra_col': 'first',  # Here is the one I want to change
      })
 .replace({True: 'Yes', False: 'No'})
)

但是,对于最后一列,如果属于 dame 组的值不同,我想连接数据

  Gene_name canonical mane  metrics Example_extra_col
0      ACOT        No  Yes  1.4-1.7           AA,BB
1       PAN       Yes   No  0.4-0.5           DD,CC
2       TOM       Yes   No  0.4-1.7           EE

如何使用 .gg 执行此操作?

试试这个:

# Custom aggregation function
func = lambda x: ", ".join([y for y in x.fillna("").unique() if y])

(
    df.groupby("Gene_name", as_index=False)
    .agg(
        {
            "canonical": "any",
            "mane": "any",
            "metrics": lambda x: f"{x.min()}-{x.max()}",
            "Example_extra_col": func,
        }
    )
    .replace({True: "Yes", False: "No"})
)

(编辑以支持列中的 NAN)