如何根据两个不同列的日期获取交叉连接table的唯一记录?

How to obtain the unique record of a cross joined table based on the dates of two different columns?

我要创建一个相当复杂的逻辑。我有一些客户诊所遭遇数据,其中包含历史测试结果,R_DATE_TESTEDR_RESULT 映射到每个 P_DATE_ENCOUNTER.

的每个客户 (P_CLIENT_ID)
RECORD_ID P_CLIENT_ID R_CLIENT_ID P_DATE_ENCOUNTER R_DATE_TESTED R_RESULT
302950 25835 25835.0 2016-12-21 2017-03-07 20.0
302951 25835 25835.0 2016-12-21 2017-08-03 20.0
302952 25835 25835.0 2016-12-21 2018-03-23 20.0
302953 25835 25835.0 2016-12-21 2019-06-28 20.0
302954 25835 25835.0 2016-12-21 2019-08-19 42.0
302955 25835 25835.0 2016-12-21 2020-04-20 40.0
302956 25835 25835.0 2016-12-21 2021-06-03 20.0
302957 25835 25835.0 2017-02-21 2017-03-07 20.0
302958 25835 25835.0 2017-02-21 2017-08-03 20.0
302959 25835 25835.0 2017-02-21 2018-03-23 20.0
302960 25835 25835.0 2017-02-21 2019-06-28 20.0
302961 25835 25835.0 2017-02-21 2019-08-19 42.0
302962 25835 25835.0 2017-02-21 2020-04-20 40.0
302963 25835 25835.0 2017-02-21 2021-06-03 20.0
302964 25835 25835.0 2017-04-25 2017-03-07 20.0
302965 25835 25835.0 2017-04-25 2017-08-03 20.0
302966 25835 25835.0 2017-04-25 2018-03-23 20.0
302967 25835 25835.0 2017-04-25 2019-06-28 20.0
302968 25835 25835.0 2017-04-25 2019-08-19 42.0
302969 25835 25835.0 2017-04-25 2020-04-20 40.0
302970 25835 25835.0 2017-04-25 2021-06-03 20.0
302971 25835 25835.0 2017-06-21 2017-03-07 20.0
302972 25835 25835.0 2017-06-21 2017-08-03 20.0
302973 25835 25835.0 2017-06-21 2018-03-23 20.0
302974 25835 25835.0 2017-06-21 2019-06-28 20.0
302975 25835 25835.0 2017-06-21 2019-08-19 42.0
302976 25835 25835.0 2017-06-21 2020-04-20 40.0
302977 25835 25835.0 2017-06-21 2021-06-03 20.0
302978 25835 25835.0 2017-09-04 2017-03-07 20.0
302979 25835 25835.0 2017-09-04 2017-08-03 20.0
302980 25835 25835.0 2017-09-04 2018-03-23 20.0
302981 25835 25835.0 2017-09-04 2019-06-28 20.0
302982 25835 25835.0 2017-09-04 2019-08-19 42.0
302983 25835 25835.0 2017-09-04 2020-04-20 40.0
302984 25835 25835.0 2017-09-04 2021-06-03 20.0
302985 25835 25835.0 2018-01-08 2017-03-07 20.0
302986 25835 25835.0 2018-01-08 2017-08-03 20.0
302987 25835 25835.0 2018-01-08 2018-03-23 20.0
302988 25835 25835.0 2018-01-08 2019-06-28 20.0
302989 25835 25835.0 2018-01-08 2019-08-19 42.0
302990 25835 25835.0 2018-01-08 2020-04-20 40.0
302991 25835 25835.0 2018-01-08 2021-06-03 20.0
302992 25835 25835.0 2018-04-03 2017-03-07 20.0
302993 25835 25835.0 2018-04-03 2017-08-03 20.0
302994 25835 25835.0 2018-04-03 2018-03-23 20.0
302995 25835 25835.0 2018-04-03 2019-06-28 20.0
302996 25835 25835.0 2018-04-03 2019-08-19 42.0
302997 25835 25835.0 2018-04-03 2020-04-20 40.0
302998 25835 25835.0 2018-04-03 2021-06-03 20.0
302999 25835 25835.0 2018-07-25 2017-03-07 20.0
303000 25835 25835.0 2018-07-25 2017-08-03 20.0
303001 25835 25835.0 2018-07-25 2018-03-23 20.0
303002 25835 25835.0 2018-07-25 2019-06-28 20.0
303003 25835 25835.0 2018-07-25 2019-08-19 42.0
303004 25835 25835.0 2018-07-25 2020-04-20 40.0
303005 25835 25835.0 2018-07-25 2021-06-03 20.0

数据已经排序。我怎样才能获得每个客户遇到的唯一记录(组 P_CLIENT_ID AND P_DATE_ENCOUNTER),其中 R_DATE_TESTED < R_DATE_ENCOUNTER(但最近一次)。此外,如果 R_DATE_TESTED < R_DATE_ENCOUNTER 不成立;它 returns 空值

逻辑结果应该如下:

P_CLIENT_ID R_CLIENT_ID P_DATE_ENCOUNTER R_DATE_TESTED R_RESULT
25835 25835.0 2016-12-21 NaN NaN
25835 25835.0 2017-02-21 NaN NaN
25835 25835.0 2017-04-25 2017-03-07 20.0
25835 25835.0 2017-06-21 2017-03-07 20.0
25835 25835.0 2017-09-04 2017-08-03 20.0
25835 25835.0 2018-01-08 2017-08-03 20.0
25835 25835.0 2018-04-03 2018-03-23 20.0

这个想法是,对于每个 P_CLIENT_ID,每个 P_ENCOUNTER_ID 都返回最近的前一个 R_RESULT(遇到之前的最新结果)。如果 CLIENT 在 P_DATE_ENCOUNTER 之前没有结果,即(R_DATE_TESTED 不是 < P_DATE_ENCOUNTERED),那么这些列的 returns 为空(可以看出前两条记录)。我想也许是在分区上使用一些排名和 .ffill() 的组合,但真的卡住了。

您可以使用此代码:

import numpy as np

# df - your DataFrame

group = df.groupby(['P_CLIENT_ID', 'P_DATE_ENCOUNTER'])

def foo(df):
    result = df.loc[df.P_DATE_ENCOUNTER>df.R_DATE_TESTED, ['R_DATE_TESTED', 'R_RESULT']].tail(1).reset_index()
    if not result.empty:
        return result
    else:
        return pd.DataFrame([[np.nan, np.nan, np.nan]], columns=['RECORD_ID','R_DATE_TESTED', 'R_RESULT'])


group.apply(foo)