Pandas 根据 2 个数据帧中不同的列数(2 和 1)合并两个文件时出错

Pandas error in merging two files based on different number of columns (2 and 1) in 2 dataframes

我有两个文件,结构如下:

df1

intA,intB
4933401J01Rik,Gm37180
Gm37686,Gm37363

df2

chr,gene_type,gene_symbol
chr1,TEC,4933401J01Rik
chr2,TEC,Gm37180
chr3,TEC,Gm37363
chr4,TEC,Gm37686

我正在尝试合并这两个文件。所以基本上我需要从 df2 中提取 df1intAintB 列的信息。在最终输出中,对于 df1 的每一列,应该有两个额外的列报告基于 df2chrgene_type。最终输出应如下所示:

结果

intA,intB,chr,chr,gene_type,gene_type
4933401J01Rik,Gm37180,chr1,chr2,TEC,TEC
Gm37686,Gm37363,chr4,chr3,TEC,TEC

我运行这个代码但是它给出了错误Can only merge Series or DataFrame objects, a <class 'str'> was passed

df1 = pd.read_csv(df1)
df2 = pd.read_csv(df2)

   
result = pd.merge(df1, df2, how='left', left_on=['intA','intB'], right_on = ['gene_symbol'])

print(result)

感谢任何帮助 - 谢谢。

可能有更 pandas 的方法来做到这一点,但这会做你想做的事:

import pandas as pd

df1 = pd.read_csv('a')
df2 = pd.read_csv('b')

df3 = pd.DataFrame(columns=['intA', 'intB', 'chrA', 'chrB', 'gene_typeA', 'gene_typeB'])

for index, row in df1.iterrows():
    aMatch = df2.loc[df2['gene_symbol'] == row['intA']]
    bMatch = df2.loc[df2['gene_symbol'] == row['intB']]
    
    if aMatch.empty or bMatch.empty:
        # malformed data somehow
        print("malformed data")

    
    df3 = df3.append( { 'intA': row['intA'], 
                        'intB': row['intB'],
                        'chrA': aMatch['chr'].values[0], 
                        'chrB': bMatch['chr'].values[0],
                        'gene_typeA': aMatch['gene_type'].values[0],
                        'gene_typeB': bMatch['gene_type'].values[0]
                      }, ignore_index=True)

结果:

            intA     intB  chrA  chrB gene_typeA gene_typeB
0  4933401J01Rik  Gm37180  chr1  chr2        TEC        TEC
1        Gm37686  Gm37363  chr4  chr3        TEC        TEC

您可以按照惯用的/Pandas-ish 方式进行操作,如下所示:

因为您打算将 df1 中的 2 列(intAintB)的内容与另一个数据框 df2 合并,并且仅匹配一列(gene_symbol), 不能直接合并。这是因为要匹配的列数不同。将导致错误 ValueError: len(right_on) must equal len(left_on)

相反,您必须先将 2 列 intAintB 转换为一列,然后在合并之前先将它们的内容放在不同的行中。

1.将df1intAintB合并为一列,内容分行:

df1a = df1.copy()
df1a.columns = df1a.columns.str.split(r'(int)', expand=True)   # split column labels
df1a = df1a.droplevel(level=0, axis=1)
df1a = df1a.stack().rename_axis(index=['index', 'int_type']).reset_index()

2。合并新列 int(合并 intAintB)来自 df1gene_symbol 来自 df2:

现在,我们可以合并来自 2 个数据帧的相同数量的列:

df_merge = pd.merge(df1a, df2, how='left', left_on='int', right_on='gene_symbol')

# remove column 'gene_symbol' which has same duplicated info as 'int'
df_merge2 = df_merge.drop('gene_symbol', axis=1)    

3。将 intAintB 转回 2 个单独的列:

df_out = df_merge2.pivot(index='index', columns='int_type')

df_out.columns = df_out.columns.map(''.join)       # combine column labels 

结果:

print(df_out)

                intA     intB  chrA  chrB gene_typeA gene_typeB
index                                                          
0      4933401J01Rik  Gm37180  chr1  chr2        TEC        TEC
1            Gm37686  Gm37363  chr4  chr3        TEC        TEC