合并两个 pandas 数据框并在用竖线分隔的列中输入匹配的条目

merge two pandas dataframes and enter the matched entries in a column separated by a pipe

我有 2 个大的 pandas 数据帧 variantsphenotype 当在 gene 列的数据帧之间映射时,它应该使用新列 HP-IDpipe 分隔。这是数据框的几行

import pandas

# variants

data_var = {'CHROM': ['Chr1', 'Chr11'], 'START': [51937273, 56867846], 'GENE': ['KCNJ1', 'NPHS2'], 'REF': ['C', 'G'], 'ALT': ['T', 'A']}

variants = pd.DataFrame(data_var)

   CHROM     START   GENE REF ALT
0   Chr1  51937273  KCNJ1   C   T
1  Chr11  56867846  NPHS2   G   A

# phenotype

data_phe = {'entrez-id': [3758, 3758, 3758, 3758, 3758, 7827, 7827, 7827, 7827],
            'GENE': ['KCNJ1', 'KCNJ1', 'KCNJ1', 'KCNJ1', 'KCNJ1', 'NPHS2', 'NPHS2', 'NPHS2', 'NPHS2'],
            'HP-ID': ['HP:0002013', 'HP:0002007', 'HP:0001561', 'HP:0000256', 'HP:0001508', 'HP:0003774', 'HP:0003678', 'HP:0000093', 'HP:0003073'],
            'phenotype': ['Vomiting', 'Frontal bossing', 'Polyhydramnios', 'Macrocephaly', 'Failure to thrive', 'Stage 5 chronic kidney disease', 'Rapidly progressive', 'Proteinuria', 'Hypoalbuminemia']}


phenotype = pd.DataFrame(data_phe)

   entrez-id   GENE       HP-ID                       phenotype
0       3758  KCNJ1  HP:0002013                        Vomiting
1       3758  KCNJ1  HP:0002007                 Frontal bossing
2       3758  KCNJ1  HP:0001561                  Polyhydramnios
3       3758  KCNJ1  HP:0000256                    Macrocephaly
4       3758  KCNJ1  HP:0001508               Failure to thrive
5       7827  NPHS2  HP:0003774  Stage 5 chronic kidney disease
6       7827  NPHS2  HP:0003678             Rapidly progressive
7       7827  NPHS2  HP:0000093                     Proteinuria
8       7827  NPHS2  HP:0003073                 Hypoalbuminemia

期望输出

CHROM  START  GENE  REF  ALT  HP-ID
Chr1  51937273  KCNJ1  C  T  HP:0002013|HP:0002007|HP:0001561|HP:0000256|HP:0001508
Chr6  56867846  NPHS2  G  A  HP:0003774|HP:0003678|HP:0000093|HP:0003073

我累了什么

data_frames = [variants, phenotype]
df_marged = reduce(lambda left,right: pd.merge(left,right,on=['GENE'],how='outer'), data_frames)

这会打印出所有的变体和表型行,当一个比另一个匹配时。

首先汇总 join GroupBy.agg and then use DataFrame.merge:

variants.merge(phenotype.groupby('GENE')['HP-ID'].agg('|'.join).reset_index(), on='GENE')