Python Pandas 用左连接填充缺失的列

Python Pandas fill missing column with left join

我有以下两个数据帧。

df_1

AA BB CC DD
"Apple" XYZ1 XYZ2
"Apple" PQR1 PQR2
"Apple" XYZ4 PRR9
"Banana" XYZ1 416
"Banana" XYZ1 416
"Apple" XYZ4 PRR9

df_lookup

AA XX YY ZZ
"Apple" XYZ1 XYZ2 429
"Apple" XYZ4 PRR9 97
"Apple" PQR1 PQR2 108
"Banana" XYZ1 PQR1 416

预期结果:

我的objective是填df_1中的空值。换句话说:

if AA == "Apple" then 
 df_1.DD = SELECT df_lookup.ZZ 
 FROM df_lookup 
 LFET JOIN df_1 
 ON df_1.BB = df_lookup.XX, df_1.CC = df_lookup.YY

相反...

if AA == "Banana" then 
 df_1.CC = SELECT df_lookup.YY 
 FROM df_lookup 
 LFET JOIN df_1 
 ON df_1.BB = df_lookup.XX, df_1.DD = df_lookup.ZZ

df_1 (filled/modified)

AA BB CC DD
"Apple" XYZ1 XYZ2 429
"Apple" PQR1 PQR2 108
"Apple" XYZ4 PRR9 97
"Banana" XYZ1 PQR1 416
"Banana" XYZ1 PQR1 416
"Apple" XYZ4 PRR9 97

到目前为止我尝试了以下方法

apple_merged = pd.merged(df_1, df_lookup, left_on = ["BB", "CC"], right_on = ["XX", "YY"])
df_1[(df_1["AA"] == "Apple")]["DD"] = apple_merged[(apple_merged.AA == "Apple")]["ZZ"].values

我在实际代码中遇到了以下错误:

ValueError: Length of values (501) does not match length of index (602)

这似乎表明数据的形状在赋值的另一侧不同,501 v/s 602。但是如果我真的做了 left join,行数应该不一样吗我在这种情况下?

当您在 pandas 中合并两个数据帧时,您必须传递一个 how = 参数,否则 pandas 默认为内部连接。然后导致错误,因为您在内部连接的 apple_merged 数据框中有 501 个值,在 df_1.

中有 602 个值

Link: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

首先使用 merge() 方法,所以这里我们使用默认的方式,即内部连接:-

df=df_1.merge(df_lookup,how='left',left_on=['AA','BB'],right_on=['AA','XX'])

现在'DD'和'CC'的值在'YY'和'ZZ'的基础上用fillna()方法填充:-

df['DD']=df['DD'].fillna(df['ZZ']).astype(int)
df['CC']=df['CC'].fillna(df['YY'])

现在您的所有值都已填充,因此我们必须通过 drop() 方法删除额外的列并将列参数传递给列表:-

df=df.drop(columns=['XX','YY','ZZ'])

现在,如果您打印 df,您将得到预期的输出:-

    AA       BB     CC      DD
0   Apple   XYZ1    XYZ2    429
1   Apple   PQR1    PQR2    108
2   Apple   XYZ4    PRR9    97
3   Banana  XYZ1    PQR1    416
4   Banana  XYZ1    PQR1    416
5   Apple   XYZ4    PRR9    97

编辑:如果 df_lookup 没有 AA 列,

df=df_1.merge(df_lookup,left_on=['BB'],right_on=['XX'])
df['DD']=df['DD'].fillna(df['ZZ']).astype(int)
df['CC']=df['CC'].fillna(df['YY'])
df=df.drop(columns=['XX','YY','ZZ'])

如果您想删除重复项,请使用:-

df=df.drop_duplicates()

使用:

d = {'XX':'BB','YY':'CC', 'ZZ':'DD'}

#column for rename
df2 = df_lookup.rename(columns=d)
#left join by defined columns
df = (df_1.merge(df2, how='left', on=['AA','BB','CC'], suffixes=('','_'))
          .merge(df2, how='left', on=['AA','BB','DD'], suffixes=('','_')))

#replaced original columns by added columns with _
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)
print (df)
       AA    BB    CC     DD
0   Apple  XYZ1  XYZ2  429.0
1   Apple  PQR1  PQR2  108.0
2   Apple  XYZ4  PRR9   97.0
3  Banana  XYZ1  PQR1  416.0
4  Banana  XYZ1  PQR1  416.0
5   Apple  XYZ4  PRR9   97.0