合并两个 pandas 数据框并在用竖线分隔的列中输入匹配的条目
merge two pandas dataframes and enter the matched entries in a column separated by a pipe
我有 2 个大的 pandas 数据帧 variants
和 phenotype
当在 gene
列的数据帧之间映射时,它应该使用新列 HP-ID
由 pipe
分隔。这是数据框的几行
import pandas
# variants
data_var = {'CHROM': ['Chr1', 'Chr11'], 'START': [51937273, 56867846], 'GENE': ['KCNJ1', 'NPHS2'], 'REF': ['C', 'G'], 'ALT': ['T', 'A']}
variants = pd.DataFrame(data_var)
CHROM START GENE REF ALT
0 Chr1 51937273 KCNJ1 C T
1 Chr11 56867846 NPHS2 G A
# phenotype
data_phe = {'entrez-id': [3758, 3758, 3758, 3758, 3758, 7827, 7827, 7827, 7827],
'GENE': ['KCNJ1', 'KCNJ1', 'KCNJ1', 'KCNJ1', 'KCNJ1', 'NPHS2', 'NPHS2', 'NPHS2', 'NPHS2'],
'HP-ID': ['HP:0002013', 'HP:0002007', 'HP:0001561', 'HP:0000256', 'HP:0001508', 'HP:0003774', 'HP:0003678', 'HP:0000093', 'HP:0003073'],
'phenotype': ['Vomiting', 'Frontal bossing', 'Polyhydramnios', 'Macrocephaly', 'Failure to thrive', 'Stage 5 chronic kidney disease', 'Rapidly progressive', 'Proteinuria', 'Hypoalbuminemia']}
phenotype = pd.DataFrame(data_phe)
entrez-id GENE HP-ID phenotype
0 3758 KCNJ1 HP:0002013 Vomiting
1 3758 KCNJ1 HP:0002007 Frontal bossing
2 3758 KCNJ1 HP:0001561 Polyhydramnios
3 3758 KCNJ1 HP:0000256 Macrocephaly
4 3758 KCNJ1 HP:0001508 Failure to thrive
5 7827 NPHS2 HP:0003774 Stage 5 chronic kidney disease
6 7827 NPHS2 HP:0003678 Rapidly progressive
7 7827 NPHS2 HP:0000093 Proteinuria
8 7827 NPHS2 HP:0003073 Hypoalbuminemia
期望输出
CHROM START GENE REF ALT HP-ID
Chr1 51937273 KCNJ1 C T HP:0002013|HP:0002007|HP:0001561|HP:0000256|HP:0001508
Chr6 56867846 NPHS2 G A HP:0003774|HP:0003678|HP:0000093|HP:0003073
我累了什么
data_frames = [variants, phenotype]
df_marged = reduce(lambda left,right: pd.merge(left,right,on=['GENE'],how='outer'), data_frames)
这会打印出所有的变体和表型行,当一个比另一个匹配时。
首先汇总 join
GroupBy.agg
and then use DataFrame.merge
:
variants.merge(phenotype.groupby('GENE')['HP-ID'].agg('|'.join).reset_index(), on='GENE')
我有 2 个大的 pandas 数据帧 variants
和 phenotype
当在 gene
列的数据帧之间映射时,它应该使用新列 HP-ID
由 pipe
分隔。这是数据框的几行
import pandas
# variants
data_var = {'CHROM': ['Chr1', 'Chr11'], 'START': [51937273, 56867846], 'GENE': ['KCNJ1', 'NPHS2'], 'REF': ['C', 'G'], 'ALT': ['T', 'A']}
variants = pd.DataFrame(data_var)
CHROM START GENE REF ALT
0 Chr1 51937273 KCNJ1 C T
1 Chr11 56867846 NPHS2 G A
# phenotype
data_phe = {'entrez-id': [3758, 3758, 3758, 3758, 3758, 7827, 7827, 7827, 7827],
'GENE': ['KCNJ1', 'KCNJ1', 'KCNJ1', 'KCNJ1', 'KCNJ1', 'NPHS2', 'NPHS2', 'NPHS2', 'NPHS2'],
'HP-ID': ['HP:0002013', 'HP:0002007', 'HP:0001561', 'HP:0000256', 'HP:0001508', 'HP:0003774', 'HP:0003678', 'HP:0000093', 'HP:0003073'],
'phenotype': ['Vomiting', 'Frontal bossing', 'Polyhydramnios', 'Macrocephaly', 'Failure to thrive', 'Stage 5 chronic kidney disease', 'Rapidly progressive', 'Proteinuria', 'Hypoalbuminemia']}
phenotype = pd.DataFrame(data_phe)
entrez-id GENE HP-ID phenotype
0 3758 KCNJ1 HP:0002013 Vomiting
1 3758 KCNJ1 HP:0002007 Frontal bossing
2 3758 KCNJ1 HP:0001561 Polyhydramnios
3 3758 KCNJ1 HP:0000256 Macrocephaly
4 3758 KCNJ1 HP:0001508 Failure to thrive
5 7827 NPHS2 HP:0003774 Stage 5 chronic kidney disease
6 7827 NPHS2 HP:0003678 Rapidly progressive
7 7827 NPHS2 HP:0000093 Proteinuria
8 7827 NPHS2 HP:0003073 Hypoalbuminemia
期望输出
CHROM START GENE REF ALT HP-ID
Chr1 51937273 KCNJ1 C T HP:0002013|HP:0002007|HP:0001561|HP:0000256|HP:0001508
Chr6 56867846 NPHS2 G A HP:0003774|HP:0003678|HP:0000093|HP:0003073
我累了什么
data_frames = [variants, phenotype]
df_marged = reduce(lambda left,right: pd.merge(left,right,on=['GENE'],how='outer'), data_frames)
这会打印出所有的变体和表型行,当一个比另一个匹配时。
首先汇总 join
GroupBy.agg
and then use DataFrame.merge
:
variants.merge(phenotype.groupby('GENE')['HP-ID'].agg('|'.join).reset_index(), on='GENE')