如何根据每个数据框中两个不同日期列的日期条件合并两个数据框?
How to merge two dataframes based on a date condition from two different date columns in each dataframe?
我有两个 Dataframes 的形式:
数据帧(df1):
P_CLIENT_ID
P_DATE_ENCOUNTER
25835
2016-12-21
25835
2017-02-21
25835
2017-04-25
25835
2017-06-21
25835
2017-09-04
25835
2018-01-08
25835
2018-04-03
数据框(df2):
R_CLIENT_ID
R_DATE_TESTED
R_RESULT
25835
2017-03-07
20.0
25835
2017-08-03
20.0
25835
2018-03-23
20.0
25835
2019-06-28
20.0
25835
2019-08-19
42.0
25835
2020-04-20
40.0
25835
2021-06-03
20.0
我想将 df2 合并到 df1(主要 table),连接键为 P_CLIENT_ID
和 R_CLIENT_ID
附加最近的 R_DATE_TESTED
和 R_RESULT
第一个条件:
如果 R_DATE_TESTED > P_DATE_ENCOUNTER
则取消 R_DATE_TESTED, R_RESULT
字段。
第二个条件:
如果 R_DATE_TESTED < P_DATE_ENCOUNTER
然后将最近的 R_DATE_TESTED, R_RESULT
字段应用到数据框,最终结果为:
逻辑结果应该如下:
P_CLIENT_ID
R_CLIENT_ID
P_DATE_ENCOUNTER
R_DATE_TESTED
R_RESULT
25835
25835.0
2016-12-21
NaN
NaN
25835
25835.0
2017-02-21
NaN
NaN
25835
25835.0
2017-04-25
2017-03-07
20.0
25835
25835.0
2017-06-21
2017-03-07
20.0
25835
25835.0
2017-09-04
2017-08-03
20.0
25835
25835.0
2018-01-08
2017-08-03
20.0
25835
25835.0
2018-04-03
2018-03-23
20.0
注意:实际数据集相当大:df1 ~ 700000 行和 df2 ~ 125000 行
代码尝试
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'P_CLIENT_D': ['25835','25835','25835','25835','25835','25835','25835'],
'P_DATE_ENCOUNTER': ['2016-12-21','2017-02-21','2017-04-25','2017-06-21','2017-09-04','2018-01-08','2018-04-03']})
df2 = pd.DataFrame({'R_CLIENT_ID': ['25835','25835','25835','25835','25835','25835','25835'],
'R_DATE_TESTED': ['2017-03-07','2017-08-03','2018-03-23','2019-06-28','2019-08-19','2020-04-20','2021-06-03'],
'R_RESULT':[20,20,20,20,42,40,20]})
df_merged = pd.merge(df1, df2, left_on=['P_CLIENT_D'], right_on = ['R_CLIENT_ID'], how='left')
df_merged = df_merged.drop_duplicates(subset=['P_CLIENT_D', 'P_DATE_ENCOUNTER'], keep='last')
df_merged['FLAG_LAB_AFTER_VISIT'] = 0
df_merged.loc[df_merged.R_DATE_TESTED >= df_merged.P_DATE_ENCOUNTER,'FLAG_LAB_AFTER_VISIT']=1
print(df_merged['FLAG_LAB_AFTER_VISIT'].sum(), 'future labs set to null')
#now the rows with flags - set all lab fields to null
df_merged.loc[df_merged['FLAG_LAB_AFTER_VISIT']==1, df2.columns] = np.nan
>>> pd.merge_asof(df1,
df2,
left_on="P_DATE_ENCOUNTER",
right_on="R_DATE_TESTED",
left_by="P_CLIENT_ID",
right_by="R_CLIENT_ID")
P_CLIENT_ID P_DATE_ENCOUNTER R_CLIENT_ID R_DATE_TESTED R_RESULT
0 25835 2016-12-21 NaN NaT NaN
1 25835 2017-02-21 NaN NaT NaN
2 25835 2017-04-25 25835.0 2017-03-07 20.0
3 25835 2017-06-21 25835.0 2017-03-07 20.0
4 25835 2017-09-04 25835.0 2017-08-03 20.0
5 25835 2018-01-08 25835.0 2017-08-03 20.0
6 25835 2018-04-03 25835.0 2018-03-23 20.0
我有两个 Dataframes 的形式:
数据帧(df1):
P_CLIENT_ID | P_DATE_ENCOUNTER |
---|---|
25835 | 2016-12-21 |
25835 | 2017-02-21 |
25835 | 2017-04-25 |
25835 | 2017-06-21 |
25835 | 2017-09-04 |
25835 | 2018-01-08 |
25835 | 2018-04-03 |
数据框(df2):
R_CLIENT_ID | R_DATE_TESTED | R_RESULT |
---|---|---|
25835 | 2017-03-07 | 20.0 |
25835 | 2017-08-03 | 20.0 |
25835 | 2018-03-23 | 20.0 |
25835 | 2019-06-28 | 20.0 |
25835 | 2019-08-19 | 42.0 |
25835 | 2020-04-20 | 40.0 |
25835 | 2021-06-03 | 20.0 |
我想将 df2 合并到 df1(主要 table),连接键为 P_CLIENT_ID
和 R_CLIENT_ID
附加最近的 R_DATE_TESTED
和 R_RESULT
第一个条件:
如果 R_DATE_TESTED > P_DATE_ENCOUNTER
则取消 R_DATE_TESTED, R_RESULT
字段。
第二个条件:
如果 R_DATE_TESTED < P_DATE_ENCOUNTER
然后将最近的 R_DATE_TESTED, R_RESULT
字段应用到数据框,最终结果为:
逻辑结果应该如下:
P_CLIENT_ID | R_CLIENT_ID | P_DATE_ENCOUNTER | R_DATE_TESTED | R_RESULT |
---|---|---|---|---|
25835 | 25835.0 | 2016-12-21 | NaN | NaN |
25835 | 25835.0 | 2017-02-21 | NaN | NaN |
25835 | 25835.0 | 2017-04-25 | 2017-03-07 | 20.0 |
25835 | 25835.0 | 2017-06-21 | 2017-03-07 | 20.0 |
25835 | 25835.0 | 2017-09-04 | 2017-08-03 | 20.0 |
25835 | 25835.0 | 2018-01-08 | 2017-08-03 | 20.0 |
25835 | 25835.0 | 2018-04-03 | 2018-03-23 | 20.0 |
注意:实际数据集相当大:df1 ~ 700000 行和 df2 ~ 125000 行
代码尝试
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'P_CLIENT_D': ['25835','25835','25835','25835','25835','25835','25835'],
'P_DATE_ENCOUNTER': ['2016-12-21','2017-02-21','2017-04-25','2017-06-21','2017-09-04','2018-01-08','2018-04-03']})
df2 = pd.DataFrame({'R_CLIENT_ID': ['25835','25835','25835','25835','25835','25835','25835'],
'R_DATE_TESTED': ['2017-03-07','2017-08-03','2018-03-23','2019-06-28','2019-08-19','2020-04-20','2021-06-03'],
'R_RESULT':[20,20,20,20,42,40,20]})
df_merged = pd.merge(df1, df2, left_on=['P_CLIENT_D'], right_on = ['R_CLIENT_ID'], how='left')
df_merged = df_merged.drop_duplicates(subset=['P_CLIENT_D', 'P_DATE_ENCOUNTER'], keep='last')
df_merged['FLAG_LAB_AFTER_VISIT'] = 0
df_merged.loc[df_merged.R_DATE_TESTED >= df_merged.P_DATE_ENCOUNTER,'FLAG_LAB_AFTER_VISIT']=1
print(df_merged['FLAG_LAB_AFTER_VISIT'].sum(), 'future labs set to null')
#now the rows with flags - set all lab fields to null
df_merged.loc[df_merged['FLAG_LAB_AFTER_VISIT']==1, df2.columns] = np.nan
>>> pd.merge_asof(df1,
df2,
left_on="P_DATE_ENCOUNTER",
right_on="R_DATE_TESTED",
left_by="P_CLIENT_ID",
right_by="R_CLIENT_ID")
P_CLIENT_ID P_DATE_ENCOUNTER R_CLIENT_ID R_DATE_TESTED R_RESULT
0 25835 2016-12-21 NaN NaT NaN
1 25835 2017-02-21 NaN NaT NaN
2 25835 2017-04-25 25835.0 2017-03-07 20.0
3 25835 2017-06-21 25835.0 2017-03-07 20.0
4 25835 2017-09-04 25835.0 2017-08-03 20.0
5 25835 2018-01-08 25835.0 2017-08-03 20.0
6 25835 2018-04-03 25835.0 2018-03-23 20.0