如何计算每组中给定事件以来的天数
How to calculate the number of days since a given event in each group
下面是一个示例数据框:
df = pd.DataFrame({'StudentName': ['Anil','Ramu','Ramu','Anil','Peter','Peter','Anil','Ramu','Peter','Anil'],
'ExamDate': ['2021-01-10','2021-01-20','2021-02-22','2021-03-30','2021-01-04','2021-06-06','2021-04-30','2021-07-30','2021-07-08','2021-09-07'],
'Result': ['Fail','Pass','Fail','Pass','Pass','Pass','Pass','Pass','Fail','Pass']})
StudentName ExamDate Result
0 Anil 2021-01-10 Fail
1 Ramu 2021-01-20 Pass
2 Ramu 2021-02-22 Fail
3 Anil 2021-03-30 Pass
4 Peter 2021-01-04 Pass
5 Peter 2021-06-06 Pass
6 Anil 2021-04-30 Pass
7 Ramu 2021-07-30 Pass
8 Peter 2021-07-08 Fail
9 Anil 2021-09-07 Pass
对于每一行,我想计算自该学生上次考试失败以来的天数:
df = pd.DataFrame({'StudentName': ['Anil','Ramu','Ramu','Anil','Peter','Peter','Anil','Ramu','Peter','Anil'],
'ExamDate': ['2021-01-10','2021-01-20','2021-02-22','2021-03-30','2021-01-04','2021-06-06','2021-04-30','2021-07-30','2021-07-08','2021-09-07'],
'Result': ['Fail','Pass','Fail','Pass','Pass','Pass','Pass','Pass','Fail','Pass'],
'LastFailedDays': [0, 0, 0, 79, 0, 0, 110, 158, 0, 240]})
StudentName ExamDate Result LastFailedDays
0 Anil 2021-01-10 Fail 0
1 Ramu 2021-01-20 Pass 0
2 Ramu 2021-02-22 Fail 0
3 Anil 2021-03-30 Pass 79
4 Peter 2021-01-04 Pass 0
5 Peter 2021-06-06 Pass 0
6 Anil 2021-04-30 Pass 110
7 Ramu 2021-07-30 Pass 158
8 Peter 2021-07-08 Fail 0
9 Anil 2021-09-07 Pass 240
例如:
- Anil 在 2021 年 1 月 10 日失败,因此对于该行,它将是零天。
- Anil 的下一个成功记录是在 2021-03-30,因此该行的天数将是他上一个失败日期 2021-01-10 到 2021-03- 的天数30,也就是79天。
- Anil 的第三次记录也是成功的,是在 2021-04-30,所以还有天数,从 2021-01-10(他最后一次失败的日期)到 2021- 的天数04-30,也就是110天
常规循环是可行的,但我正在寻找更传统的 Pandas 解决方案。我猜 groupby
.
是可能的
我终于想出了一个可行的解决方案。
# Process the data a bit
df['Tmp_Result'] = df['Result'].map({'Pass': 1, 'Fail': 0})
df['ExamDate'] = pd.to_datetime(df['ExamDate'])
# Create a mask that will be used to group the rows by StudentName + consecutive passed tests after a failed test (including the failed test)
sorted_df = df.sort_values(['StudentName', 'ExamDate'])
mask = sorted_df.groupby('StudentName')['Tmp_Result'].diff().ne(0).cumsum()
mask[(sorted_df['Tmp_Result'].eq(0) & ~(pd.isna(sorted_df.groupby('StudentName')['Tmp_Result'].shift(-1))))] += 1
df['LastFailedDays'] = df.groupby(mask)['ExamDate'].diff().fillna(pd.Timedelta(0))
df['LastFailedDays'] = df.groupby(mask)['LastFailedDays'].cumsum()
# Cleanup
df = df.drop('Tmp_Result', axis=1)
输出:
>>> df
StudentName ExamDate Result LastFailedDays
0 Anil 2021-01-10 Fail 0 days
1 Ramu 2021-01-20 Pass 0 days
2 Ramu 2021-02-22 Fail 0 days
3 Anil 2021-03-30 Pass 79 days
4 Peter 2021-01-04 Pass 0 days
5 Peter 2021-06-06 Pass 153 days
6 Anil 2021-04-30 Pass 110 days
7 Ramu 2021-07-30 Pass 158 days
8 Peter 2021-07-08 Fail 0 days
9 Anil 2021-09-07 Pass 240 days
>>> df.sort_values(['StudentName', 'ExamDate'])
StudentName ExamDate Result LastFailedDays
0 Anil 2021-01-10 Fail 0 days
3 Anil 2021-03-30 Pass 79 days
6 Anil 2021-04-30 Pass 110 days
9 Anil 2021-09-07 Pass 240 days
4 Peter 2021-01-04 Pass 0 days
5 Peter 2021-06-06 Pass 153 days
8 Peter 2021-07-08 Fail 0 days
1 Ramu 2021-01-20 Pass 0 days
2 Ramu 2021-02-22 Fail 0 days
7 Ramu 2021-07-30 Pass 158 days
看起来有点可怕,但因为它是矢量化的,所以它应该比任何使用循环的解决方案都快得多。
TL;DR
使用Series.where
and groupby.ffill
生成每个学生的最后一次失败日期并从ExamDate
中减去得到LastFailedDays
:
df['ExamDate'] = pd.to_datetime(df['ExamDate'])
df['LastFailedDays'] = (df['ExamDate'].sub(
df['ExamDate'].where(df['Result'] == 'Fail').groupby(df['StudentName']).ffill()
).dt.days.fillna(0))
# StudentName ExamDate Result LastFailedDays
# 0 Anil 2021-01-10 Fail 0.0
# 1 Ramu 2021-01-20 Pass 0.0
# 2 Ramu 2021-02-22 Fail 0.0
# 3 Anil 2021-03-30 Pass 79.0
# 4 Peter 2021-01-04 Pass 0.0
# 5 Peter 2021-06-06 Pass 0.0
# 6 Anil 2021-04-30 Pass 110.0
# 7 Ramu 2021-07-30 Pass 158.0
# 8 Peter 2021-07-08 Fail 0.0
# 9 Anil 2021-09-07 Pass 240.0
回复:评论,按多列分组,例如StudentClass
和 StudentName
,使用列表作为石斑鱼:
...groupby([df['StudentClass'], df['StudentName']]).ffill()
详情
转换to_datetime
:
df['ExamDate'] = pd.to_datetime(df['ExamDate'])
使用Series.where
生成每个学生的最后失败日期(这里我将其设为一列以便于可视化):
df['LastFailedDate'] = df['ExamDate'].where(df['Result'] == 'Fail')
# StudentName ExamDate Result LastFailedDate
# 0 Anil 2021-01-10 Fail 2021-01-10
# 1 Ramu 2021-01-20 Pass NaT
# 2 Ramu 2021-02-22 Fail 2021-02-22
# 3 Anil 2021-03-30 Pass NaT
# 4 Peter 2021-01-04 Pass NaT
# 5 Peter 2021-06-06 Pass NaT
# 6 Anil 2021-04-30 Pass NaT
# 7 Ramu 2021-07-30 Pass NaT
# 8 Peter 2021-07-08 Fail 2021-07-08
# 9 Anil 2021-09-07 Pass NaT
使用 groupby.ffill
向前填写每个学生的最后一次失败日期(NaT
如果之前没有失败的考试):
df['LastFailedDate'] = df['LastFailedDate'].groupby(df['StudentName']).ffill()
# StudentName ExamDate Result LastFailedDate
# 0 Anil 2021-01-10 Fail 2021-01-10
# 1 Ramu 2021-01-20 Pass NaT
# 2 Ramu 2021-02-22 Fail 2021-02-22
# 3 Anil 2021-03-30 Pass 2021-01-10
# 4 Peter 2021-01-04 Pass NaT
# 5 Peter 2021-06-06 Pass NaT
# 6 Anil 2021-04-30 Pass 2021-01-10
# 7 Ramu 2021-07-30 Pass 2021-02-22
# 8 Peter 2021-07-08 Fail 2021-07-08
# 9 Anil 2021-09-07 Pass 2021-01-10
最后用最后一次失败的日期减去考试日期并使用dt.days
提取天数:
df['LastFailedDays'] = df['ExamDate'].sub(df['LastFailedDate']).dt.days.fillna(0)
# StudentName ExamDate Result LastFailedDate LastFailedDays
# 0 Anil 2021-01-10 Fail 2021-01-10 0.0
# 1 Ramu 2021-01-20 Pass NaT 0.0
# 2 Ramu 2021-02-22 Fail 2021-02-22 0.0
# 3 Anil 2021-03-30 Pass 2021-01-10 79.0
# 4 Peter 2021-01-04 Pass NaT 0.0
# 5 Peter 2021-06-06 Pass NaT 0.0
# 6 Anil 2021-04-30 Pass 2021-01-10 110.0
# 7 Ramu 2021-07-30 Pass 2021-02-22 158.0
# 8 Peter 2021-07-08 Fail 2021-07-08 0.0
# 9 Anil 2021-09-07 Pass 2021-01-10 240.0
下面是一个示例数据框:
df = pd.DataFrame({'StudentName': ['Anil','Ramu','Ramu','Anil','Peter','Peter','Anil','Ramu','Peter','Anil'],
'ExamDate': ['2021-01-10','2021-01-20','2021-02-22','2021-03-30','2021-01-04','2021-06-06','2021-04-30','2021-07-30','2021-07-08','2021-09-07'],
'Result': ['Fail','Pass','Fail','Pass','Pass','Pass','Pass','Pass','Fail','Pass']})
StudentName ExamDate Result
0 Anil 2021-01-10 Fail
1 Ramu 2021-01-20 Pass
2 Ramu 2021-02-22 Fail
3 Anil 2021-03-30 Pass
4 Peter 2021-01-04 Pass
5 Peter 2021-06-06 Pass
6 Anil 2021-04-30 Pass
7 Ramu 2021-07-30 Pass
8 Peter 2021-07-08 Fail
9 Anil 2021-09-07 Pass
对于每一行,我想计算自该学生上次考试失败以来的天数:
df = pd.DataFrame({'StudentName': ['Anil','Ramu','Ramu','Anil','Peter','Peter','Anil','Ramu','Peter','Anil'],
'ExamDate': ['2021-01-10','2021-01-20','2021-02-22','2021-03-30','2021-01-04','2021-06-06','2021-04-30','2021-07-30','2021-07-08','2021-09-07'],
'Result': ['Fail','Pass','Fail','Pass','Pass','Pass','Pass','Pass','Fail','Pass'],
'LastFailedDays': [0, 0, 0, 79, 0, 0, 110, 158, 0, 240]})
StudentName ExamDate Result LastFailedDays
0 Anil 2021-01-10 Fail 0
1 Ramu 2021-01-20 Pass 0
2 Ramu 2021-02-22 Fail 0
3 Anil 2021-03-30 Pass 79
4 Peter 2021-01-04 Pass 0
5 Peter 2021-06-06 Pass 0
6 Anil 2021-04-30 Pass 110
7 Ramu 2021-07-30 Pass 158
8 Peter 2021-07-08 Fail 0
9 Anil 2021-09-07 Pass 240
例如:
- Anil 在 2021 年 1 月 10 日失败,因此对于该行,它将是零天。
- Anil 的下一个成功记录是在 2021-03-30,因此该行的天数将是他上一个失败日期 2021-01-10 到 2021-03- 的天数30,也就是79天。
- Anil 的第三次记录也是成功的,是在 2021-04-30,所以还有天数,从 2021-01-10(他最后一次失败的日期)到 2021- 的天数04-30,也就是110天
常规循环是可行的,但我正在寻找更传统的 Pandas 解决方案。我猜 groupby
.
我终于想出了一个可行的解决方案。
# Process the data a bit
df['Tmp_Result'] = df['Result'].map({'Pass': 1, 'Fail': 0})
df['ExamDate'] = pd.to_datetime(df['ExamDate'])
# Create a mask that will be used to group the rows by StudentName + consecutive passed tests after a failed test (including the failed test)
sorted_df = df.sort_values(['StudentName', 'ExamDate'])
mask = sorted_df.groupby('StudentName')['Tmp_Result'].diff().ne(0).cumsum()
mask[(sorted_df['Tmp_Result'].eq(0) & ~(pd.isna(sorted_df.groupby('StudentName')['Tmp_Result'].shift(-1))))] += 1
df['LastFailedDays'] = df.groupby(mask)['ExamDate'].diff().fillna(pd.Timedelta(0))
df['LastFailedDays'] = df.groupby(mask)['LastFailedDays'].cumsum()
# Cleanup
df = df.drop('Tmp_Result', axis=1)
输出:
>>> df
StudentName ExamDate Result LastFailedDays
0 Anil 2021-01-10 Fail 0 days
1 Ramu 2021-01-20 Pass 0 days
2 Ramu 2021-02-22 Fail 0 days
3 Anil 2021-03-30 Pass 79 days
4 Peter 2021-01-04 Pass 0 days
5 Peter 2021-06-06 Pass 153 days
6 Anil 2021-04-30 Pass 110 days
7 Ramu 2021-07-30 Pass 158 days
8 Peter 2021-07-08 Fail 0 days
9 Anil 2021-09-07 Pass 240 days
>>> df.sort_values(['StudentName', 'ExamDate'])
StudentName ExamDate Result LastFailedDays
0 Anil 2021-01-10 Fail 0 days
3 Anil 2021-03-30 Pass 79 days
6 Anil 2021-04-30 Pass 110 days
9 Anil 2021-09-07 Pass 240 days
4 Peter 2021-01-04 Pass 0 days
5 Peter 2021-06-06 Pass 153 days
8 Peter 2021-07-08 Fail 0 days
1 Ramu 2021-01-20 Pass 0 days
2 Ramu 2021-02-22 Fail 0 days
7 Ramu 2021-07-30 Pass 158 days
看起来有点可怕,但因为它是矢量化的,所以它应该比任何使用循环的解决方案都快得多。
TL;DR
使用Series.where
and groupby.ffill
生成每个学生的最后一次失败日期并从ExamDate
中减去得到LastFailedDays
:
df['ExamDate'] = pd.to_datetime(df['ExamDate'])
df['LastFailedDays'] = (df['ExamDate'].sub(
df['ExamDate'].where(df['Result'] == 'Fail').groupby(df['StudentName']).ffill()
).dt.days.fillna(0))
# StudentName ExamDate Result LastFailedDays
# 0 Anil 2021-01-10 Fail 0.0
# 1 Ramu 2021-01-20 Pass 0.0
# 2 Ramu 2021-02-22 Fail 0.0
# 3 Anil 2021-03-30 Pass 79.0
# 4 Peter 2021-01-04 Pass 0.0
# 5 Peter 2021-06-06 Pass 0.0
# 6 Anil 2021-04-30 Pass 110.0
# 7 Ramu 2021-07-30 Pass 158.0
# 8 Peter 2021-07-08 Fail 0.0
# 9 Anil 2021-09-07 Pass 240.0
回复:评论,按多列分组,例如StudentClass
和 StudentName
,使用列表作为石斑鱼:
...groupby([df['StudentClass'], df['StudentName']]).ffill()
详情
转换
to_datetime
:df['ExamDate'] = pd.to_datetime(df['ExamDate'])
使用
Series.where
生成每个学生的最后失败日期(这里我将其设为一列以便于可视化):df['LastFailedDate'] = df['ExamDate'].where(df['Result'] == 'Fail') # StudentName ExamDate Result LastFailedDate # 0 Anil 2021-01-10 Fail 2021-01-10 # 1 Ramu 2021-01-20 Pass NaT # 2 Ramu 2021-02-22 Fail 2021-02-22 # 3 Anil 2021-03-30 Pass NaT # 4 Peter 2021-01-04 Pass NaT # 5 Peter 2021-06-06 Pass NaT # 6 Anil 2021-04-30 Pass NaT # 7 Ramu 2021-07-30 Pass NaT # 8 Peter 2021-07-08 Fail 2021-07-08 # 9 Anil 2021-09-07 Pass NaT
使用
groupby.ffill
向前填写每个学生的最后一次失败日期(NaT
如果之前没有失败的考试):df['LastFailedDate'] = df['LastFailedDate'].groupby(df['StudentName']).ffill() # StudentName ExamDate Result LastFailedDate # 0 Anil 2021-01-10 Fail 2021-01-10 # 1 Ramu 2021-01-20 Pass NaT # 2 Ramu 2021-02-22 Fail 2021-02-22 # 3 Anil 2021-03-30 Pass 2021-01-10 # 4 Peter 2021-01-04 Pass NaT # 5 Peter 2021-06-06 Pass NaT # 6 Anil 2021-04-30 Pass 2021-01-10 # 7 Ramu 2021-07-30 Pass 2021-02-22 # 8 Peter 2021-07-08 Fail 2021-07-08 # 9 Anil 2021-09-07 Pass 2021-01-10
最后用最后一次失败的日期减去考试日期并使用
dt.days
提取天数:df['LastFailedDays'] = df['ExamDate'].sub(df['LastFailedDate']).dt.days.fillna(0) # StudentName ExamDate Result LastFailedDate LastFailedDays # 0 Anil 2021-01-10 Fail 2021-01-10 0.0 # 1 Ramu 2021-01-20 Pass NaT 0.0 # 2 Ramu 2021-02-22 Fail 2021-02-22 0.0 # 3 Anil 2021-03-30 Pass 2021-01-10 79.0 # 4 Peter 2021-01-04 Pass NaT 0.0 # 5 Peter 2021-06-06 Pass NaT 0.0 # 6 Anil 2021-04-30 Pass 2021-01-10 110.0 # 7 Ramu 2021-07-30 Pass 2021-02-22 158.0 # 8 Peter 2021-07-08 Fail 2021-07-08 0.0 # 9 Anil 2021-09-07 Pass 2021-01-10 240.0