随着时间线的增加，基于局部最小值过滤数据帧

Question

已编辑：

我有以下学生的数据框，他们的考试成绩在不同的日期（排序）：

df = pd.DataFrame({'student': 'A A A B B B B C C'.split(),
                  'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
                               datetime.datetime(2013,7,1),datetime.datetime(2013,9,2),
                               datetime.datetime(2013,10,1),datetime.datetime(2013,11,2),
                               datetime.datetime(2014,2,2),datetime.datetime(2013,7,1),
                               datetime.datetime(2013,9,2),],
                   'score': [15, 17, 32, 22, 28, 24, 33, 33, 15]})

print(df)

  student  exam_date  score
0       A 2013-04-01     15
1       A 2013-06-01     17
2       A 2013-07-01     32
3       B 2013-09-02     22
4       B 2013-10-01     28
5       B 2013-11-02     24
6       B 2014-02-02     33
7       C 2013-07-01     33
8       C 2013-09-02     15

我只需要保留分数比局部最小值增加 10 以上的那些行。

例如，对于学生A，局部最小值是15并且在下一个日期分数增加到32，所以我们是会保留那个。

对于学生 B，没有分数从局部最小值增加超过 10。 28-22 和 33-24 都小于 10.

对于学生C，局部最小值是15，但之后分数没有增加，所以我们要放弃它。

我正在尝试以下脚本：

out = df[df['score'] - df.groupby('student', as_index=False)['score'].cummin()['score']>= 10]

print(out)
2   A   2013-07-01  32
6   B   2014-02-02  33 #--Shouldn't capture this as it's increased by `9` from local minima of `24`

期望输出：

   student  exam_date  score
2        A  2013-07-01  32

# For A, score of 32 is increased by 17 from local minima of 15

最聪明的做法是什么？任何建议，将不胜感激。谢谢！

Answer 1

结合你发布的@Corralien 的解决方案，我想出了一个 one-liner 很好用的方法：

filtered = df.groupby('student', as_index=False).apply(lambda x: None if (v := (x['score'].cummax() * (x['score'] > x['score'].shift()) - (x['score'].cummin()) >= 10)).sum() == 0 else x.loc[v.idxmax()] ).dropna()

输出：

>>> filtered
  student  exam_date  score
0       A 2013-06-01   27.0
1       B 2013-10-01   43.0

Answer 2

我们可以尝试以下方法：

使用 groupby + diff.
找出每个学生的连续分数之间的差异
使用where，将NaN值分配给得分差异小于10的所有行
使用groupby + first得到每个学生的第一个分数差大于10。

msk = (diff>10) | (diff.groupby([diff[::-1].shift().lt(0).cumsum()[::-1], df['student']]).cumsum()>10)
out = df.where(msk).groupby('student').first().reset_index()

输出：

  student  exam_date  score
0       A 2013-06-01   27.0
1       B 2013-10-01   43.0

Answer 3

假设您的数据框已按日期排序：

highest_score = lambda x: x['score'] - x['score'].mask(x['score'].gt(x['score'].shift())).ffill() > 10
out = df[df.groupby('student').apply(highest_score).droplevel(0)]
print(out)

# Output
  student  exam_date  score
2       A 2013-07-01     32

关注lambda函数

让我们修改您的数据框并提取一名学生以避免 groupby:

>>> df = df[df['student'] == 'B']
  student  exam_date  score
3       B 2013-09-02     22
4       B 2013-10-01     28
5       B 2013-11-02     24
6       B 2014-02-02     33

# Step-1: find row where value is not a local minima
>>> df['score'].gt(df['score'].shift())
3    False
4     True
5    False
6     True
Name: score, dtype: bool

# Step-2: hide non local minima values
>>> df['score'].mask(df['score'].gt(df['score'].shift()))
3    22.0
4     NaN
5    24.0
6     NaN
Name: score, dtype: float64

# Step-3: fill forward local minima values
>>> df['score'].mask(df['score'].gt(df['score'].shift()))
3    22.0
4    22.0
5    24.0
6    24.0
Name: score, dtype: float64

# Step-4: check if the condition is True
>>> df['score'] - df['score'].mask(df['score'].gt(df['score'].shift())) > 10
3    False
4    False
5    False
6    False
Name: score, dtype: bool

随着时间线的增加，基于局部最小值过滤数据帧

Filter Dataframe Based on Local Minima with Increasing Timeline

python

datetime

data-manipulation

dataframe

pandas