随着时间线的增加,基于局部最小值过滤数据帧
Filter Dataframe Based on Local Minima with Increasing Timeline
已编辑:
我有以下学生的数据框,他们的考试成绩在不同的日期(排序):
df = pd.DataFrame({'student': 'A A A B B B B C C'.split(),
'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
datetime.datetime(2013,7,1),datetime.datetime(2013,9,2),
datetime.datetime(2013,10,1),datetime.datetime(2013,11,2),
datetime.datetime(2014,2,2),datetime.datetime(2013,7,1),
datetime.datetime(2013,9,2),],
'score': [15, 17, 32, 22, 28, 24, 33, 33, 15]})
print(df)
student exam_date score
0 A 2013-04-01 15
1 A 2013-06-01 17
2 A 2013-07-01 32
3 B 2013-09-02 22
4 B 2013-10-01 28
5 B 2013-11-02 24
6 B 2014-02-02 33
7 C 2013-07-01 33
8 C 2013-09-02 15
我只需要保留分数比局部最小值增加 10 以上的那些行。
例如,对于学生A
,局部最小值是15
并且在下一个日期分数增加到32
,所以我们是会保留那个。
对于学生 B
,没有分数从局部最小值增加超过 10
。 28-22
和 33-24
都小于 10
.
对于学生C
,局部最小值是15
,但之后分数没有增加,所以我们要放弃它。
我正在尝试以下脚本:
out = df[df['score'] - df.groupby('student', as_index=False)['score'].cummin()['score']>= 10]
print(out)
2 A 2013-07-01 32
6 B 2014-02-02 33 #--Shouldn't capture this as it's increased by `9` from local minima of `24`
期望输出:
student exam_date score
2 A 2013-07-01 32
# For A, score of 32 is increased by 17 from local minima of 15
最聪明的做法是什么?任何建议,将不胜感激。谢谢!
结合你发布的@Corralien 的解决方案,我想出了一个 one-liner 很好用的方法:
filtered = df.groupby('student', as_index=False).apply(lambda x: None if (v := (x['score'].cummax() * (x['score'] > x['score'].shift()) - (x['score'].cummin()) >= 10)).sum() == 0 else x.loc[v.idxmax()] ).dropna()
输出:
>>> filtered
student exam_date score
0 A 2013-06-01 27.0
1 B 2013-10-01 43.0
我们可以尝试以下方法:
使用 groupby
+ diff
.
找出每个学生的连续分数之间的差异
使用where
,将NaN值分配给得分差异小于10的所有行
使用groupby
+ first
得到每个学生的第一个分数差大于10。
msk = (diff>10) | (diff.groupby([diff[::-1].shift().lt(0).cumsum()[::-1], df['student']]).cumsum()>10)
out = df.where(msk).groupby('student').first().reset_index()
输出:
student exam_date score
0 A 2013-06-01 27.0
1 B 2013-10-01 43.0
假设您的数据框已按日期排序:
highest_score = lambda x: x['score'] - x['score'].mask(x['score'].gt(x['score'].shift())).ffill() > 10
out = df[df.groupby('student').apply(highest_score).droplevel(0)]
print(out)
# Output
student exam_date score
2 A 2013-07-01 32
关注lambda函数
让我们修改您的数据框并提取一名学生以避免 groupby
:
>>> df = df[df['student'] == 'B']
student exam_date score
3 B 2013-09-02 22
4 B 2013-10-01 28
5 B 2013-11-02 24
6 B 2014-02-02 33
# Step-1: find row where value is not a local minima
>>> df['score'].gt(df['score'].shift())
3 False
4 True
5 False
6 True
Name: score, dtype: bool
# Step-2: hide non local minima values
>>> df['score'].mask(df['score'].gt(df['score'].shift()))
3 22.0
4 NaN
5 24.0
6 NaN
Name: score, dtype: float64
# Step-3: fill forward local minima values
>>> df['score'].mask(df['score'].gt(df['score'].shift()))
3 22.0
4 22.0
5 24.0
6 24.0
Name: score, dtype: float64
# Step-4: check if the condition is True
>>> df['score'] - df['score'].mask(df['score'].gt(df['score'].shift())) > 10
3 False
4 False
5 False
6 False
Name: score, dtype: bool
已编辑:
我有以下学生的数据框,他们的考试成绩在不同的日期(排序):
df = pd.DataFrame({'student': 'A A A B B B B C C'.split(),
'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
datetime.datetime(2013,7,1),datetime.datetime(2013,9,2),
datetime.datetime(2013,10,1),datetime.datetime(2013,11,2),
datetime.datetime(2014,2,2),datetime.datetime(2013,7,1),
datetime.datetime(2013,9,2),],
'score': [15, 17, 32, 22, 28, 24, 33, 33, 15]})
print(df)
student exam_date score
0 A 2013-04-01 15
1 A 2013-06-01 17
2 A 2013-07-01 32
3 B 2013-09-02 22
4 B 2013-10-01 28
5 B 2013-11-02 24
6 B 2014-02-02 33
7 C 2013-07-01 33
8 C 2013-09-02 15
我只需要保留分数比局部最小值增加 10 以上的那些行。
例如,对于学生A
,局部最小值是15
并且在下一个日期分数增加到32
,所以我们是会保留那个。
对于学生 B
,没有分数从局部最小值增加超过 10
。 28-22
和 33-24
都小于 10
.
对于学生C
,局部最小值是15
,但之后分数没有增加,所以我们要放弃它。
我正在尝试以下脚本:
out = df[df['score'] - df.groupby('student', as_index=False)['score'].cummin()['score']>= 10]
print(out)
2 A 2013-07-01 32
6 B 2014-02-02 33 #--Shouldn't capture this as it's increased by `9` from local minima of `24`
期望输出:
student exam_date score
2 A 2013-07-01 32
# For A, score of 32 is increased by 17 from local minima of 15
最聪明的做法是什么?任何建议,将不胜感激。谢谢!
结合你发布的@Corralien 的解决方案,我想出了一个 one-liner 很好用的方法:
filtered = df.groupby('student', as_index=False).apply(lambda x: None if (v := (x['score'].cummax() * (x['score'] > x['score'].shift()) - (x['score'].cummin()) >= 10)).sum() == 0 else x.loc[v.idxmax()] ).dropna()
输出:
>>> filtered
student exam_date score
0 A 2013-06-01 27.0
1 B 2013-10-01 43.0
我们可以尝试以下方法:
使用
找出每个学生的连续分数之间的差异groupby
+diff
.使用
where
,将NaN值分配给得分差异小于10的所有行使用
groupby
+first
得到每个学生的第一个分数差大于10。
msk = (diff>10) | (diff.groupby([diff[::-1].shift().lt(0).cumsum()[::-1], df['student']]).cumsum()>10)
out = df.where(msk).groupby('student').first().reset_index()
输出:
student exam_date score
0 A 2013-06-01 27.0
1 B 2013-10-01 43.0
假设您的数据框已按日期排序:
highest_score = lambda x: x['score'] - x['score'].mask(x['score'].gt(x['score'].shift())).ffill() > 10
out = df[df.groupby('student').apply(highest_score).droplevel(0)]
print(out)
# Output
student exam_date score
2 A 2013-07-01 32
关注lambda函数
让我们修改您的数据框并提取一名学生以避免 groupby
:
>>> df = df[df['student'] == 'B']
student exam_date score
3 B 2013-09-02 22
4 B 2013-10-01 28
5 B 2013-11-02 24
6 B 2014-02-02 33
# Step-1: find row where value is not a local minima
>>> df['score'].gt(df['score'].shift())
3 False
4 True
5 False
6 True
Name: score, dtype: bool
# Step-2: hide non local minima values
>>> df['score'].mask(df['score'].gt(df['score'].shift()))
3 22.0
4 NaN
5 24.0
6 NaN
Name: score, dtype: float64
# Step-3: fill forward local minima values
>>> df['score'].mask(df['score'].gt(df['score'].shift()))
3 22.0
4 22.0
5 24.0
6 24.0
Name: score, dtype: float64
# Step-4: check if the condition is True
>>> df['score'] - df['score'].mask(df['score'].gt(df['score'].shift())) > 10
3 False
4 False
5 False
6 False
Name: score, dtype: bool