如何根据两个不同列的日期获取交叉连接table的唯一记录?
How to obtain the unique record of a cross joined table based on the dates of two different columns?
我要创建一个相当复杂的逻辑。我有一些客户诊所遭遇数据,其中包含历史测试结果,R_DATE_TESTED
,R_RESULT
映射到每个 P_DATE_ENCOUNTER
.
的每个客户 (P_CLIENT_ID
)
RECORD_ID
P_CLIENT_ID
R_CLIENT_ID
P_DATE_ENCOUNTER
R_DATE_TESTED
R_RESULT
302950
25835
25835.0
2016-12-21
2017-03-07
20.0
302951
25835
25835.0
2016-12-21
2017-08-03
20.0
302952
25835
25835.0
2016-12-21
2018-03-23
20.0
302953
25835
25835.0
2016-12-21
2019-06-28
20.0
302954
25835
25835.0
2016-12-21
2019-08-19
42.0
302955
25835
25835.0
2016-12-21
2020-04-20
40.0
302956
25835
25835.0
2016-12-21
2021-06-03
20.0
302957
25835
25835.0
2017-02-21
2017-03-07
20.0
302958
25835
25835.0
2017-02-21
2017-08-03
20.0
302959
25835
25835.0
2017-02-21
2018-03-23
20.0
302960
25835
25835.0
2017-02-21
2019-06-28
20.0
302961
25835
25835.0
2017-02-21
2019-08-19
42.0
302962
25835
25835.0
2017-02-21
2020-04-20
40.0
302963
25835
25835.0
2017-02-21
2021-06-03
20.0
302964
25835
25835.0
2017-04-25
2017-03-07
20.0
302965
25835
25835.0
2017-04-25
2017-08-03
20.0
302966
25835
25835.0
2017-04-25
2018-03-23
20.0
302967
25835
25835.0
2017-04-25
2019-06-28
20.0
302968
25835
25835.0
2017-04-25
2019-08-19
42.0
302969
25835
25835.0
2017-04-25
2020-04-20
40.0
302970
25835
25835.0
2017-04-25
2021-06-03
20.0
302971
25835
25835.0
2017-06-21
2017-03-07
20.0
302972
25835
25835.0
2017-06-21
2017-08-03
20.0
302973
25835
25835.0
2017-06-21
2018-03-23
20.0
302974
25835
25835.0
2017-06-21
2019-06-28
20.0
302975
25835
25835.0
2017-06-21
2019-08-19
42.0
302976
25835
25835.0
2017-06-21
2020-04-20
40.0
302977
25835
25835.0
2017-06-21
2021-06-03
20.0
302978
25835
25835.0
2017-09-04
2017-03-07
20.0
302979
25835
25835.0
2017-09-04
2017-08-03
20.0
302980
25835
25835.0
2017-09-04
2018-03-23
20.0
302981
25835
25835.0
2017-09-04
2019-06-28
20.0
302982
25835
25835.0
2017-09-04
2019-08-19
42.0
302983
25835
25835.0
2017-09-04
2020-04-20
40.0
302984
25835
25835.0
2017-09-04
2021-06-03
20.0
302985
25835
25835.0
2018-01-08
2017-03-07
20.0
302986
25835
25835.0
2018-01-08
2017-08-03
20.0
302987
25835
25835.0
2018-01-08
2018-03-23
20.0
302988
25835
25835.0
2018-01-08
2019-06-28
20.0
302989
25835
25835.0
2018-01-08
2019-08-19
42.0
302990
25835
25835.0
2018-01-08
2020-04-20
40.0
302991
25835
25835.0
2018-01-08
2021-06-03
20.0
302992
25835
25835.0
2018-04-03
2017-03-07
20.0
302993
25835
25835.0
2018-04-03
2017-08-03
20.0
302994
25835
25835.0
2018-04-03
2018-03-23
20.0
302995
25835
25835.0
2018-04-03
2019-06-28
20.0
302996
25835
25835.0
2018-04-03
2019-08-19
42.0
302997
25835
25835.0
2018-04-03
2020-04-20
40.0
302998
25835
25835.0
2018-04-03
2021-06-03
20.0
302999
25835
25835.0
2018-07-25
2017-03-07
20.0
303000
25835
25835.0
2018-07-25
2017-08-03
20.0
303001
25835
25835.0
2018-07-25
2018-03-23
20.0
303002
25835
25835.0
2018-07-25
2019-06-28
20.0
303003
25835
25835.0
2018-07-25
2019-08-19
42.0
303004
25835
25835.0
2018-07-25
2020-04-20
40.0
303005
25835
25835.0
2018-07-25
2021-06-03
20.0
数据已经排序。我怎样才能获得每个客户遇到的唯一记录(组 P_CLIENT_ID AND P_DATE_ENCOUNTER
),其中 R_DATE_TESTED < R_DATE_ENCOUNTER
(但最近一次)。此外,如果 R_DATE_TESTED < R_DATE_ENCOUNTER
不成立;它 returns 空值
逻辑结果应该如下:
P_CLIENT_ID
R_CLIENT_ID
P_DATE_ENCOUNTER
R_DATE_TESTED
R_RESULT
25835
25835.0
2016-12-21
NaN
NaN
25835
25835.0
2017-02-21
NaN
NaN
25835
25835.0
2017-04-25
2017-03-07
20.0
25835
25835.0
2017-06-21
2017-03-07
20.0
25835
25835.0
2017-09-04
2017-08-03
20.0
25835
25835.0
2018-01-08
2017-08-03
20.0
25835
25835.0
2018-04-03
2018-03-23
20.0
这个想法是,对于每个 P_CLIENT_ID
,每个 P_ENCOUNTER_ID
都返回最近的前一个 R_RESULT
(遇到之前的最新结果)。如果 CLIENT 在 P_DATE_ENCOUNTER
之前没有结果,即(R_DATE_TESTED
不是 < P_DATE_ENCOUNTERED
),那么这些列的 returns 为空(可以看出前两条记录)。我想也许是在分区上使用一些排名和 .ffill()
的组合,但真的卡住了。
您可以使用此代码:
import numpy as np
# df - your DataFrame
group = df.groupby(['P_CLIENT_ID', 'P_DATE_ENCOUNTER'])
def foo(df):
result = df.loc[df.P_DATE_ENCOUNTER>df.R_DATE_TESTED, ['R_DATE_TESTED', 'R_RESULT']].tail(1).reset_index()
if not result.empty:
return result
else:
return pd.DataFrame([[np.nan, np.nan, np.nan]], columns=['RECORD_ID','R_DATE_TESTED', 'R_RESULT'])
group.apply(foo)
我要创建一个相当复杂的逻辑。我有一些客户诊所遭遇数据,其中包含历史测试结果,R_DATE_TESTED
,R_RESULT
映射到每个 P_DATE_ENCOUNTER
.
P_CLIENT_ID
)
RECORD_ID | P_CLIENT_ID | R_CLIENT_ID | P_DATE_ENCOUNTER | R_DATE_TESTED | R_RESULT |
---|---|---|---|---|---|
302950 | 25835 | 25835.0 | 2016-12-21 | 2017-03-07 | 20.0 |
302951 | 25835 | 25835.0 | 2016-12-21 | 2017-08-03 | 20.0 |
302952 | 25835 | 25835.0 | 2016-12-21 | 2018-03-23 | 20.0 |
302953 | 25835 | 25835.0 | 2016-12-21 | 2019-06-28 | 20.0 |
302954 | 25835 | 25835.0 | 2016-12-21 | 2019-08-19 | 42.0 |
302955 | 25835 | 25835.0 | 2016-12-21 | 2020-04-20 | 40.0 |
302956 | 25835 | 25835.0 | 2016-12-21 | 2021-06-03 | 20.0 |
302957 | 25835 | 25835.0 | 2017-02-21 | 2017-03-07 | 20.0 |
302958 | 25835 | 25835.0 | 2017-02-21 | 2017-08-03 | 20.0 |
302959 | 25835 | 25835.0 | 2017-02-21 | 2018-03-23 | 20.0 |
302960 | 25835 | 25835.0 | 2017-02-21 | 2019-06-28 | 20.0 |
302961 | 25835 | 25835.0 | 2017-02-21 | 2019-08-19 | 42.0 |
302962 | 25835 | 25835.0 | 2017-02-21 | 2020-04-20 | 40.0 |
302963 | 25835 | 25835.0 | 2017-02-21 | 2021-06-03 | 20.0 |
302964 | 25835 | 25835.0 | 2017-04-25 | 2017-03-07 | 20.0 |
302965 | 25835 | 25835.0 | 2017-04-25 | 2017-08-03 | 20.0 |
302966 | 25835 | 25835.0 | 2017-04-25 | 2018-03-23 | 20.0 |
302967 | 25835 | 25835.0 | 2017-04-25 | 2019-06-28 | 20.0 |
302968 | 25835 | 25835.0 | 2017-04-25 | 2019-08-19 | 42.0 |
302969 | 25835 | 25835.0 | 2017-04-25 | 2020-04-20 | 40.0 |
302970 | 25835 | 25835.0 | 2017-04-25 | 2021-06-03 | 20.0 |
302971 | 25835 | 25835.0 | 2017-06-21 | 2017-03-07 | 20.0 |
302972 | 25835 | 25835.0 | 2017-06-21 | 2017-08-03 | 20.0 |
302973 | 25835 | 25835.0 | 2017-06-21 | 2018-03-23 | 20.0 |
302974 | 25835 | 25835.0 | 2017-06-21 | 2019-06-28 | 20.0 |
302975 | 25835 | 25835.0 | 2017-06-21 | 2019-08-19 | 42.0 |
302976 | 25835 | 25835.0 | 2017-06-21 | 2020-04-20 | 40.0 |
302977 | 25835 | 25835.0 | 2017-06-21 | 2021-06-03 | 20.0 |
302978 | 25835 | 25835.0 | 2017-09-04 | 2017-03-07 | 20.0 |
302979 | 25835 | 25835.0 | 2017-09-04 | 2017-08-03 | 20.0 |
302980 | 25835 | 25835.0 | 2017-09-04 | 2018-03-23 | 20.0 |
302981 | 25835 | 25835.0 | 2017-09-04 | 2019-06-28 | 20.0 |
302982 | 25835 | 25835.0 | 2017-09-04 | 2019-08-19 | 42.0 |
302983 | 25835 | 25835.0 | 2017-09-04 | 2020-04-20 | 40.0 |
302984 | 25835 | 25835.0 | 2017-09-04 | 2021-06-03 | 20.0 |
302985 | 25835 | 25835.0 | 2018-01-08 | 2017-03-07 | 20.0 |
302986 | 25835 | 25835.0 | 2018-01-08 | 2017-08-03 | 20.0 |
302987 | 25835 | 25835.0 | 2018-01-08 | 2018-03-23 | 20.0 |
302988 | 25835 | 25835.0 | 2018-01-08 | 2019-06-28 | 20.0 |
302989 | 25835 | 25835.0 | 2018-01-08 | 2019-08-19 | 42.0 |
302990 | 25835 | 25835.0 | 2018-01-08 | 2020-04-20 | 40.0 |
302991 | 25835 | 25835.0 | 2018-01-08 | 2021-06-03 | 20.0 |
302992 | 25835 | 25835.0 | 2018-04-03 | 2017-03-07 | 20.0 |
302993 | 25835 | 25835.0 | 2018-04-03 | 2017-08-03 | 20.0 |
302994 | 25835 | 25835.0 | 2018-04-03 | 2018-03-23 | 20.0 |
302995 | 25835 | 25835.0 | 2018-04-03 | 2019-06-28 | 20.0 |
302996 | 25835 | 25835.0 | 2018-04-03 | 2019-08-19 | 42.0 |
302997 | 25835 | 25835.0 | 2018-04-03 | 2020-04-20 | 40.0 |
302998 | 25835 | 25835.0 | 2018-04-03 | 2021-06-03 | 20.0 |
302999 | 25835 | 25835.0 | 2018-07-25 | 2017-03-07 | 20.0 |
303000 | 25835 | 25835.0 | 2018-07-25 | 2017-08-03 | 20.0 |
303001 | 25835 | 25835.0 | 2018-07-25 | 2018-03-23 | 20.0 |
303002 | 25835 | 25835.0 | 2018-07-25 | 2019-06-28 | 20.0 |
303003 | 25835 | 25835.0 | 2018-07-25 | 2019-08-19 | 42.0 |
303004 | 25835 | 25835.0 | 2018-07-25 | 2020-04-20 | 40.0 |
303005 | 25835 | 25835.0 | 2018-07-25 | 2021-06-03 | 20.0 |
数据已经排序。我怎样才能获得每个客户遇到的唯一记录(组 P_CLIENT_ID AND P_DATE_ENCOUNTER
),其中 R_DATE_TESTED < R_DATE_ENCOUNTER
(但最近一次)。此外,如果 R_DATE_TESTED < R_DATE_ENCOUNTER
不成立;它 returns 空值
逻辑结果应该如下:
P_CLIENT_ID | R_CLIENT_ID | P_DATE_ENCOUNTER | R_DATE_TESTED | R_RESULT |
---|---|---|---|---|
25835 | 25835.0 | 2016-12-21 | NaN | NaN |
25835 | 25835.0 | 2017-02-21 | NaN | NaN |
25835 | 25835.0 | 2017-04-25 | 2017-03-07 | 20.0 |
25835 | 25835.0 | 2017-06-21 | 2017-03-07 | 20.0 |
25835 | 25835.0 | 2017-09-04 | 2017-08-03 | 20.0 |
25835 | 25835.0 | 2018-01-08 | 2017-08-03 | 20.0 |
25835 | 25835.0 | 2018-04-03 | 2018-03-23 | 20.0 |
这个想法是,对于每个 P_CLIENT_ID
,每个 P_ENCOUNTER_ID
都返回最近的前一个 R_RESULT
(遇到之前的最新结果)。如果 CLIENT 在 P_DATE_ENCOUNTER
之前没有结果,即(R_DATE_TESTED
不是 < P_DATE_ENCOUNTERED
),那么这些列的 returns 为空(可以看出前两条记录)。我想也许是在分区上使用一些排名和 .ffill()
的组合,但真的卡住了。
您可以使用此代码:
import numpy as np
# df - your DataFrame
group = df.groupby(['P_CLIENT_ID', 'P_DATE_ENCOUNTER'])
def foo(df):
result = df.loc[df.P_DATE_ENCOUNTER>df.R_DATE_TESTED, ['R_DATE_TESTED', 'R_RESULT']].tail(1).reset_index()
if not result.empty:
return result
else:
return pd.DataFrame([[np.nan, np.nan, np.nan]], columns=['RECORD_ID','R_DATE_TESTED', 'R_RESULT'])
group.apply(foo)