如何确定使用 pandas 的个人是否连续进行了多少年的测试？

Question

如何根据第一次考试的日期判断某个人群是否满足连续几年考试的条件？可以在下面找到示例数据集。我正在考虑将数据集 1 用作 df1，将数据集 2 用作 df2，但我的问题是我不确定如何使用不同 ID 的第一个收集日期来减去相同 ID 的不同收集日期？

数据集 1：

ID	Date of Collection	Test Done
My-ID 00001	10/05/2016	1A
My-ID 00001	10/01/2017	1A
My-ID 00001	23/01/2018	1A
My-ID 00001	18/04/2019	1A
My-ID 00001	30/04/2020	1A
My-ID 00002	30/09/2015	1A
My-ID 00002	31/05/2016	1A
My-ID 00002	31/05/2017	1A
My-ID 00003	31/05/2017	1A

数据集 2：

ID	Test Done	Result	Date of Collection
My-ID 00001	1A	50	10/05/2016
My-ID 00002	1A	75	30/09/2015

期望的结果：

ID	Date of Collection	Test Done	Year since first collection date
My-ID 00001	10/05/2016	1A	0
My-ID 00001	10/01/2017	1A	1
My-ID 00001	23/01/2018	1A	2
My-ID 00001	18/04/2019	1A	3
My-ID 00001	30/04/2020	1A	4
My-ID 00002	30/09/2015	1A	0
My-ID 00002	31/05/2016	1A	1

Answer 1

如果可能的话，使用每组的第一年减去每组的第一个值 GroupBy.transform with GroupBy.first:

df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)

y = df1['Date of Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))
print (df1)
            ID Date of Collection Test Done  Year since first collection date
0  My-ID 00001         2016-05-10        1A                                 0
1  My-ID 00001         2017-01-10        1A                                 1
2  My-ID 00001         2018-01-23        1A                                 2
3  My-ID 00001         2019-04-18        1A                                 3
4  My-ID 00001         2020-04-30        1A                                 4
5  My-ID 00002         2015-09-30        1A                                 0
6  My-ID 00002         2016-05-31        1A                                 1
7  My-ID 00003         2017-05-31        1A                                 0

如果需要处理来自 df2 的第一个值，请在 DataFrame.merge 的解决方案之前添加左连接：

df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)
df2['Date of First Collection'] = pd.to_datetime(df2['Date of First Collection'], dayfirst=True)

y = df1.merge(df2, on=['ID','Test Done'], how='left')['Date of First Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))

如何确定使用 pandas 的个人是否连续进行了多少年的测试？

How to determine if a test was done how many years consecutively for individuals using pandas?

python

numpy

pandas

jupyter

jupyter-notebook