如何在不添加新列的情况下合并我的数据框以补偿丢失的数据?
How to merge my dataframe to compensate for missing data without adding new columns?
我正在尝试操纵 excel sheet 数据以在 excel(不是开发人员)上自动执行流程,我有 2 个数据帧:
一个看起来像下面(唯一的区别是更多的列)
Date Val1 Val2
0 2020-09-29 13:22:57 5.34 3.2
1 2020-09-29 13:23:12 4.5 Nan
2 2020-09-29 13:23:44 Nan 56.4
3 2020-09-29 13:24:01 24 0.3
我们注意到上面的索引是有序的,所有字段都填充了日期,但不一定填充所有其他列。
第二个数据帧具有以下特征,相等或更多的行没有任何额外的日期,也没有重复的日期但额外的行是空的(NaT for Date and Nan for all other columns),df2 的索引是由于其他进程也没有按顺序排列:
Date Val1 Val2
0 2020-09-29 13:22:57 Nan Nan
5 Nat Nan Nan
1 2020-09-29 13:23:12 4.5 Nan
4 NaT Nan Nan
6 Nat Nan Nan
2 2020-09-29 13:23:44 Nan Nan
3 2020-09-29 13:24:01 24 0.3
我基本上需要的是检查匹配日期,如果 df2 中的日期与 df1 中的日期匹配,则在不更改位置的情况下为 df2 中该日期的整行填充相同的精确值df2 中的空行或添加列:
预期输出:
Date Val1 Val2
0 2020-09-29 13:22:57 5.34 3.2
5 Nat Nan Nan
1 2020-09-29 13:23:12 4.5 Nan
4 NaT Nan Nan
6 Nat Nan Nan
2 2020-09-29 13:23:44 Nan 56.4
3 2020-09-29 13:24:01 24 0.3
我尝试了多种方法,包括:
data_frames = [df,df_2]
df_merged = reduce(lambda left, right: pd.merge(left, right, on=['Date'],
how='outer'), data_frames)
print(df_merged)
还有:
df_f = pd.merge(df, df_2, on='Date', how='outer').fillna(method='ffill')
也尝试将 how
更改为 inner
、left
、right
等等,但没有得到我想要的结果,我只是得到组合列.
编辑:
df1 = pd.DataFrame({'Date': ['2020-09-29 13:22:57', '2020-09-29 13:23:12', '2020-09-29 13:23:44', '2020-09-29 13:24:01'],
'Val1': [5.34, 4.5, np.nan, 24],
'Val2': [3.2, np.nan, 56.4, 0.3]})
df2 = pd.DataFrame({'Date': ['2020-09-29 13:22:57', np.nan, '2020-09-29 13:23:12', np.nan, np.nan, '2020-09-29 13:23:44', '2020-09-29 13:24:01'],
'Val1': [5.34, np.nan, 4.5, np.nan, np.nan, np.nan, 24],
'Val2': [3.2, np.nan, np.nan, np.nan, np.nan, 56.4, 0.3]},
index=[0,5,1,4,6,2,3])
f_f1 = df1.merge(df2["Date"], on="Date", how="right").set_index(df2.index)
print(f_f1)
IIUC,试试:
#convert to datetime if needed
df1["Date"] = pd.to_datetime(df1["Date"])
df2["Date"] = pd.to_datetime(df2["Date"])
f_f1 = df1.merge(df2["Date"], on="Date", how="right").set_index(df2.index)
>>> df_f
Date Val1 Val2
0 2020-09-29 13:22:57 5.34 3.2
5 NaN NaN NaN
1 2020-09-29 13:23:12 4.50 NaN
4 NaN NaN NaN
6 NaN NaN NaN
2 2020-09-29 13:23:44 NaN 56.4
3 2020-09-29 13:24:01 24.00 0.3
输入:
df1 = pd.DataFrame({'Date': ['2020-09-29 13:22:57', '2020-09-29 13:23:12', '2020-09-29 13:23:44', '2020-09-29 13:24:01'],
'Val1': [5.34, 4.5, np.nan, 24],
'Val2': [3.2, np.nan, 56.4, 0.3]})
df2 = pd.DataFrame({'Date': ['2020-09-29 13:22:57', np.nan, '2020-09-29 13:23:12', np.nan, np.nan, '2020-09-29 13:23:44', '2020-09-29 13:24:01'],
'Val1': [5.34, np.nan, 4.5, np.nan, np.nan, np.nan, 24],
'Val2': [3.2, np.nan, np.nan, np.nan, np.nan, 56.4, 0.3]},
index=[0,5,1,4,6,2,3])
我正在尝试操纵 excel sheet 数据以在 excel(不是开发人员)上自动执行流程,我有 2 个数据帧:
一个看起来像下面(唯一的区别是更多的列)
Date Val1 Val2
0 2020-09-29 13:22:57 5.34 3.2
1 2020-09-29 13:23:12 4.5 Nan
2 2020-09-29 13:23:44 Nan 56.4
3 2020-09-29 13:24:01 24 0.3
我们注意到上面的索引是有序的,所有字段都填充了日期,但不一定填充所有其他列。
第二个数据帧具有以下特征,相等或更多的行没有任何额外的日期,也没有重复的日期但额外的行是空的(NaT for Date and Nan for all other columns),df2 的索引是由于其他进程也没有按顺序排列:
Date Val1 Val2
0 2020-09-29 13:22:57 Nan Nan
5 Nat Nan Nan
1 2020-09-29 13:23:12 4.5 Nan
4 NaT Nan Nan
6 Nat Nan Nan
2 2020-09-29 13:23:44 Nan Nan
3 2020-09-29 13:24:01 24 0.3
我基本上需要的是检查匹配日期,如果 df2 中的日期与 df1 中的日期匹配,则在不更改位置的情况下为 df2 中该日期的整行填充相同的精确值df2 中的空行或添加列:
预期输出:
Date Val1 Val2
0 2020-09-29 13:22:57 5.34 3.2
5 Nat Nan Nan
1 2020-09-29 13:23:12 4.5 Nan
4 NaT Nan Nan
6 Nat Nan Nan
2 2020-09-29 13:23:44 Nan 56.4
3 2020-09-29 13:24:01 24 0.3
我尝试了多种方法,包括:
data_frames = [df,df_2]
df_merged = reduce(lambda left, right: pd.merge(left, right, on=['Date'],
how='outer'), data_frames)
print(df_merged)
还有:
df_f = pd.merge(df, df_2, on='Date', how='outer').fillna(method='ffill')
也尝试将 how
更改为 inner
、left
、right
等等,但没有得到我想要的结果,我只是得到组合列.
编辑:
df1 = pd.DataFrame({'Date': ['2020-09-29 13:22:57', '2020-09-29 13:23:12', '2020-09-29 13:23:44', '2020-09-29 13:24:01'],
'Val1': [5.34, 4.5, np.nan, 24],
'Val2': [3.2, np.nan, 56.4, 0.3]})
df2 = pd.DataFrame({'Date': ['2020-09-29 13:22:57', np.nan, '2020-09-29 13:23:12', np.nan, np.nan, '2020-09-29 13:23:44', '2020-09-29 13:24:01'],
'Val1': [5.34, np.nan, 4.5, np.nan, np.nan, np.nan, 24],
'Val2': [3.2, np.nan, np.nan, np.nan, np.nan, 56.4, 0.3]},
index=[0,5,1,4,6,2,3])
f_f1 = df1.merge(df2["Date"], on="Date", how="right").set_index(df2.index)
print(f_f1)
IIUC,试试:
#convert to datetime if needed
df1["Date"] = pd.to_datetime(df1["Date"])
df2["Date"] = pd.to_datetime(df2["Date"])
f_f1 = df1.merge(df2["Date"], on="Date", how="right").set_index(df2.index)
>>> df_f
Date Val1 Val2
0 2020-09-29 13:22:57 5.34 3.2
5 NaN NaN NaN
1 2020-09-29 13:23:12 4.50 NaN
4 NaN NaN NaN
6 NaN NaN NaN
2 2020-09-29 13:23:44 NaN 56.4
3 2020-09-29 13:24:01 24.00 0.3
输入:
df1 = pd.DataFrame({'Date': ['2020-09-29 13:22:57', '2020-09-29 13:23:12', '2020-09-29 13:23:44', '2020-09-29 13:24:01'],
'Val1': [5.34, 4.5, np.nan, 24],
'Val2': [3.2, np.nan, 56.4, 0.3]})
df2 = pd.DataFrame({'Date': ['2020-09-29 13:22:57', np.nan, '2020-09-29 13:23:12', np.nan, np.nan, '2020-09-29 13:23:44', '2020-09-29 13:24:01'],
'Val1': [5.34, np.nan, 4.5, np.nan, np.nan, np.nan, 24],
'Val2': [3.2, np.nan, np.nan, np.nan, np.nan, 56.4, 0.3]},
index=[0,5,1,4,6,2,3])