如何确定使用 pandas 的个人是否连续进行了多少年的测试?

How to determine if a test was done how many years consecutively for individuals using pandas?

如何根据第一次考试的日期判断某个人群是否满足连续几年考试的条件?可以在下面找到示例数据集。我正在考虑将数据集 1 用作 df1,将数据集 2 用作 df2,但我的问题是我不确定如何使用不同 ID 的第一个收集日期来减去相同 ID 的不同收集日期?

数据集 1:

ID Date of Collection Test Done
My-ID 00001 10/05/2016 1A
My-ID 00001 10/01/2017 1A
My-ID 00001 23/01/2018 1A
My-ID 00001 18/04/2019 1A
My-ID 00001 30/04/2020 1A
My-ID 00002 30/09/2015 1A
My-ID 00002 31/05/2016 1A
My-ID 00002 31/05/2017 1A
My-ID 00003 31/05/2017 1A

数据集 2:

ID Test Done Result Date of Collection
My-ID 00001 1A 50 10/05/2016
My-ID 00002 1A 75 30/09/2015

期望的结果:

ID Date of Collection Test Done Year since first collection date
My-ID 00001 10/05/2016 1A 0
My-ID 00001 10/01/2017 1A 1
My-ID 00001 23/01/2018 1A 2
My-ID 00001 18/04/2019 1A 3
My-ID 00001 30/04/2020 1A 4
My-ID 00002 30/09/2015 1A 0
My-ID 00002 31/05/2016 1A 1

如果可能的话,使用每组的第一年减去每组的第一个值 GroupBy.transform with GroupBy.first:

df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)

y = df1['Date of Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))
print (df1)
            ID Date of Collection Test Done  Year since first collection date
0  My-ID 00001         2016-05-10        1A                                 0
1  My-ID 00001         2017-01-10        1A                                 1
2  My-ID 00001         2018-01-23        1A                                 2
3  My-ID 00001         2019-04-18        1A                                 3
4  My-ID 00001         2020-04-30        1A                                 4
5  My-ID 00002         2015-09-30        1A                                 0
6  My-ID 00002         2016-05-31        1A                                 1
7  My-ID 00003         2017-05-31        1A                                 0

如果需要处理来自 df2 的第一个值,请在 DataFrame.merge 的解决方案之前添加左连接:

df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)
df2['Date of First Collection'] = pd.to_datetime(df2['Date of First Collection'], dayfirst=True)

y = df1.merge(df2, on=['ID','Test Done'], how='left')['Date of First Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))