如何确定使用 pandas 的个人是否连续进行了多少年的测试?
How to determine if a test was done how many years consecutively for individuals using pandas?
如何根据第一次考试的日期判断某个人群是否满足连续几年考试的条件?可以在下面找到示例数据集。我正在考虑将数据集 1 用作 df1,将数据集 2 用作 df2,但我的问题是我不确定如何使用不同 ID 的第一个收集日期来减去相同 ID 的不同收集日期?
数据集 1:
ID
Date of Collection
Test Done
My-ID 00001
10/05/2016
1A
My-ID 00001
10/01/2017
1A
My-ID 00001
23/01/2018
1A
My-ID 00001
18/04/2019
1A
My-ID 00001
30/04/2020
1A
My-ID 00002
30/09/2015
1A
My-ID 00002
31/05/2016
1A
My-ID 00002
31/05/2017
1A
My-ID 00003
31/05/2017
1A
数据集 2:
ID
Test Done
Result
Date of Collection
My-ID 00001
1A
50
10/05/2016
My-ID 00002
1A
75
30/09/2015
期望的结果:
ID
Date of Collection
Test Done
Year since first collection date
My-ID 00001
10/05/2016
1A
0
My-ID 00001
10/01/2017
1A
1
My-ID 00001
23/01/2018
1A
2
My-ID 00001
18/04/2019
1A
3
My-ID 00001
30/04/2020
1A
4
My-ID 00002
30/09/2015
1A
0
My-ID 00002
31/05/2016
1A
1
如果可能的话,使用每组的第一年减去每组的第一个值 GroupBy.transform
with GroupBy.first
:
df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)
y = df1['Date of Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))
print (df1)
ID Date of Collection Test Done Year since first collection date
0 My-ID 00001 2016-05-10 1A 0
1 My-ID 00001 2017-01-10 1A 1
2 My-ID 00001 2018-01-23 1A 2
3 My-ID 00001 2019-04-18 1A 3
4 My-ID 00001 2020-04-30 1A 4
5 My-ID 00002 2015-09-30 1A 0
6 My-ID 00002 2016-05-31 1A 1
7 My-ID 00003 2017-05-31 1A 0
如果需要处理来自 df2
的第一个值,请在 DataFrame.merge
的解决方案之前添加左连接:
df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)
df2['Date of First Collection'] = pd.to_datetime(df2['Date of First Collection'], dayfirst=True)
y = df1.merge(df2, on=['ID','Test Done'], how='left')['Date of First Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))
如何根据第一次考试的日期判断某个人群是否满足连续几年考试的条件?可以在下面找到示例数据集。我正在考虑将数据集 1 用作 df1,将数据集 2 用作 df2,但我的问题是我不确定如何使用不同 ID 的第一个收集日期来减去相同 ID 的不同收集日期?
数据集 1:
ID | Date of Collection | Test Done |
---|---|---|
My-ID 00001 | 10/05/2016 | 1A |
My-ID 00001 | 10/01/2017 | 1A |
My-ID 00001 | 23/01/2018 | 1A |
My-ID 00001 | 18/04/2019 | 1A |
My-ID 00001 | 30/04/2020 | 1A |
My-ID 00002 | 30/09/2015 | 1A |
My-ID 00002 | 31/05/2016 | 1A |
My-ID 00002 | 31/05/2017 | 1A |
My-ID 00003 | 31/05/2017 | 1A |
数据集 2:
ID | Test Done | Result | Date of Collection |
---|---|---|---|
My-ID 00001 | 1A | 50 | 10/05/2016 |
My-ID 00002 | 1A | 75 | 30/09/2015 |
期望的结果:
ID | Date of Collection | Test Done | Year since first collection date |
---|---|---|---|
My-ID 00001 | 10/05/2016 | 1A | 0 |
My-ID 00001 | 10/01/2017 | 1A | 1 |
My-ID 00001 | 23/01/2018 | 1A | 2 |
My-ID 00001 | 18/04/2019 | 1A | 3 |
My-ID 00001 | 30/04/2020 | 1A | 4 |
My-ID 00002 | 30/09/2015 | 1A | 0 |
My-ID 00002 | 31/05/2016 | 1A | 1 |
如果可能的话,使用每组的第一年减去每组的第一个值 GroupBy.transform
with GroupBy.first
:
df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)
y = df1['Date of Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))
print (df1)
ID Date of Collection Test Done Year since first collection date
0 My-ID 00001 2016-05-10 1A 0
1 My-ID 00001 2017-01-10 1A 1
2 My-ID 00001 2018-01-23 1A 2
3 My-ID 00001 2019-04-18 1A 3
4 My-ID 00001 2020-04-30 1A 4
5 My-ID 00002 2015-09-30 1A 0
6 My-ID 00002 2016-05-31 1A 1
7 My-ID 00003 2017-05-31 1A 0
如果需要处理来自 df2
的第一个值,请在 DataFrame.merge
的解决方案之前添加左连接:
df1['Date of Collection'] = pd.to_datetime(df1['Date of Collection'], dayfirst=True)
df2['Date of First Collection'] = pd.to_datetime(df2['Date of First Collection'], dayfirst=True)
y = df1.merge(df2, on=['ID','Test Done'], how='left')['Date of First Collection'].dt.year
df1['Year since first collection date'] = y.sub(y.groupby(df['ID']).transform('first'))