Python Pandas 用左连接填充缺失的列
Python Pandas fill missing column with left join
我有以下两个数据帧。
df_1
AA
BB
CC
DD
"Apple"
XYZ1
XYZ2
"Apple"
PQR1
PQR2
"Apple"
XYZ4
PRR9
"Banana"
XYZ1
416
"Banana"
XYZ1
416
"Apple"
XYZ4
PRR9
df_lookup
AA
XX
YY
ZZ
"Apple"
XYZ1
XYZ2
429
"Apple"
XYZ4
PRR9
97
"Apple"
PQR1
PQR2
108
"Banana"
XYZ1
PQR1
416
预期结果:
我的objective是填df_1中的空值。换句话说:
if AA == "Apple" then
df_1.DD = SELECT df_lookup.ZZ
FROM df_lookup
LFET JOIN df_1
ON df_1.BB = df_lookup.XX, df_1.CC = df_lookup.YY
相反...
if AA == "Banana" then
df_1.CC = SELECT df_lookup.YY
FROM df_lookup
LFET JOIN df_1
ON df_1.BB = df_lookup.XX, df_1.DD = df_lookup.ZZ
df_1 (filled/modified)
AA
BB
CC
DD
"Apple"
XYZ1
XYZ2
429
"Apple"
PQR1
PQR2
108
"Apple"
XYZ4
PRR9
97
"Banana"
XYZ1
PQR1
416
"Banana"
XYZ1
PQR1
416
"Apple"
XYZ4
PRR9
97
到目前为止我尝试了以下方法
apple_merged = pd.merged(df_1, df_lookup, left_on = ["BB", "CC"], right_on = ["XX", "YY"])
df_1[(df_1["AA"] == "Apple")]["DD"] = apple_merged[(apple_merged.AA == "Apple")]["ZZ"].values
我在实际代码中遇到了以下错误:
ValueError: Length of values (501) does not match length of index
(602)
这似乎表明数据的形状在赋值的另一侧不同,501 v/s 602。但是如果我真的做了 left join,行数应该不一样吗我在这种情况下?
当您在 pandas 中合并两个数据帧时,您必须传递一个 how =
参数,否则 pandas 默认为内部连接。然后导致错误,因为您在内部连接的 apple_merged
数据框中有 501 个值,在 df_1
.
中有 602 个值
Link: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
首先使用 merge()
方法,所以这里我们使用默认的方式,即内部连接:-
df=df_1.merge(df_lookup,how='left',left_on=['AA','BB'],right_on=['AA','XX'])
现在'DD'和'CC'的值在'YY'和'ZZ'的基础上用fillna()
方法填充:-
df['DD']=df['DD'].fillna(df['ZZ']).astype(int)
df['CC']=df['CC'].fillna(df['YY'])
现在您的所有值都已填充,因此我们必须通过 drop()
方法删除额外的列并将列参数传递给列表:-
df=df.drop(columns=['XX','YY','ZZ'])
现在,如果您打印 df
,您将得到预期的输出:-
AA BB CC DD
0 Apple XYZ1 XYZ2 429
1 Apple PQR1 PQR2 108
2 Apple XYZ4 PRR9 97
3 Banana XYZ1 PQR1 416
4 Banana XYZ1 PQR1 416
5 Apple XYZ4 PRR9 97
编辑:如果 df_lookup 没有 AA 列,
df=df_1.merge(df_lookup,left_on=['BB'],right_on=['XX'])
df['DD']=df['DD'].fillna(df['ZZ']).astype(int)
df['CC']=df['CC'].fillna(df['YY'])
df=df.drop(columns=['XX','YY','ZZ'])
如果您想删除重复项,请使用:-
df=df.drop_duplicates()
使用:
d = {'XX':'BB','YY':'CC', 'ZZ':'DD'}
#column for rename
df2 = df_lookup.rename(columns=d)
#left join by defined columns
df = (df_1.merge(df2, how='left', on=['AA','BB','CC'], suffixes=('','_'))
.merge(df2, how='left', on=['AA','BB','DD'], suffixes=('','_')))
#replaced original columns by added columns with _
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)
print (df)
AA BB CC DD
0 Apple XYZ1 XYZ2 429.0
1 Apple PQR1 PQR2 108.0
2 Apple XYZ4 PRR9 97.0
3 Banana XYZ1 PQR1 416.0
4 Banana XYZ1 PQR1 416.0
5 Apple XYZ4 PRR9 97.0
我有以下两个数据帧。
df_1
AA | BB | CC | DD |
---|---|---|---|
"Apple" | XYZ1 | XYZ2 | |
"Apple" | PQR1 | PQR2 | |
"Apple" | XYZ4 | PRR9 | |
"Banana" | XYZ1 | 416 | |
"Banana" | XYZ1 | 416 | |
"Apple" | XYZ4 | PRR9 |
df_lookup
AA | XX | YY | ZZ |
---|---|---|---|
"Apple" | XYZ1 | XYZ2 | 429 |
"Apple" | XYZ4 | PRR9 | 97 |
"Apple" | PQR1 | PQR2 | 108 |
"Banana" | XYZ1 | PQR1 | 416 |
预期结果:
我的objective是填df_1中的空值。换句话说:
if AA == "Apple" then
df_1.DD = SELECT df_lookup.ZZ
FROM df_lookup
LFET JOIN df_1
ON df_1.BB = df_lookup.XX, df_1.CC = df_lookup.YY
相反...
if AA == "Banana" then
df_1.CC = SELECT df_lookup.YY
FROM df_lookup
LFET JOIN df_1
ON df_1.BB = df_lookup.XX, df_1.DD = df_lookup.ZZ
df_1 (filled/modified)
AA | BB | CC | DD |
---|---|---|---|
"Apple" | XYZ1 | XYZ2 | 429 |
"Apple" | PQR1 | PQR2 | 108 |
"Apple" | XYZ4 | PRR9 | 97 |
"Banana" | XYZ1 | PQR1 | 416 |
"Banana" | XYZ1 | PQR1 | 416 |
"Apple" | XYZ4 | PRR9 | 97 |
到目前为止我尝试了以下方法
apple_merged = pd.merged(df_1, df_lookup, left_on = ["BB", "CC"], right_on = ["XX", "YY"])
df_1[(df_1["AA"] == "Apple")]["DD"] = apple_merged[(apple_merged.AA == "Apple")]["ZZ"].values
我在实际代码中遇到了以下错误:
ValueError: Length of values (501) does not match length of index (602)
这似乎表明数据的形状在赋值的另一侧不同,501 v/s 602。但是如果我真的做了 left join,行数应该不一样吗我在这种情况下?
当您在 pandas 中合并两个数据帧时,您必须传递一个 how =
参数,否则 pandas 默认为内部连接。然后导致错误,因为您在内部连接的 apple_merged
数据框中有 501 个值,在 df_1
.
Link: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
首先使用 merge()
方法,所以这里我们使用默认的方式,即内部连接:-
df=df_1.merge(df_lookup,how='left',left_on=['AA','BB'],right_on=['AA','XX'])
现在'DD'和'CC'的值在'YY'和'ZZ'的基础上用fillna()
方法填充:-
df['DD']=df['DD'].fillna(df['ZZ']).astype(int)
df['CC']=df['CC'].fillna(df['YY'])
现在您的所有值都已填充,因此我们必须通过 drop()
方法删除额外的列并将列参数传递给列表:-
df=df.drop(columns=['XX','YY','ZZ'])
现在,如果您打印 df
,您将得到预期的输出:-
AA BB CC DD
0 Apple XYZ1 XYZ2 429
1 Apple PQR1 PQR2 108
2 Apple XYZ4 PRR9 97
3 Banana XYZ1 PQR1 416
4 Banana XYZ1 PQR1 416
5 Apple XYZ4 PRR9 97
编辑:如果 df_lookup 没有 AA 列,
df=df_1.merge(df_lookup,left_on=['BB'],right_on=['XX'])
df['DD']=df['DD'].fillna(df['ZZ']).astype(int)
df['CC']=df['CC'].fillna(df['YY'])
df=df.drop(columns=['XX','YY','ZZ'])
如果您想删除重复项,请使用:-
df=df.drop_duplicates()
使用:
d = {'XX':'BB','YY':'CC', 'ZZ':'DD'}
#column for rename
df2 = df_lookup.rename(columns=d)
#left join by defined columns
df = (df_1.merge(df2, how='left', on=['AA','BB','CC'], suffixes=('','_'))
.merge(df2, how='left', on=['AA','BB','DD'], suffixes=('','_')))
#replaced original columns by added columns with _
cols = df.columns[df.columns.str.endswith('_')]
df = df.combine_first(df[cols].rename(columns=lambda x: x.strip('_'))).drop(cols, axis=1)
print (df)
AA BB CC DD
0 Apple XYZ1 XYZ2 429.0
1 Apple PQR1 PQR2 108.0
2 Apple XYZ4 PRR9 97.0
3 Banana XYZ1 PQR1 416.0
4 Banana XYZ1 PQR1 416.0
5 Apple XYZ4 PRR9 97.0