创建采用值斜率的新列

Question

我有以下学生在不同日期的考试成绩数据框：

df = pd.DataFrame({'student': 'A A A B B B B'.split(),
                  'exam_date': pd.date_range(start='1/1/2020', periods=7, freq='M'),
                  'score': [15, 28, 17, 22, 43, 40, 52]})

print(df)

  student  exam_date  score
0       A 2020-01-31     15
1       A 2020-02-29     28
2       A 2020-03-31     17
3       B 2020-04-30     22
4       B 2020-05-31     43
5       B 2020-06-30     40
6       B 2020-07-31     52

我需要创建一个新列——每个学生的分数斜率。

我正在尝试以下脚本：

df.exam_date = pd.to_datetime(df.exam_date)

df['date_ordinal'] = pd.to_datetime(df['exam_date']).map(dt.datetime.toordinal)

slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(df['date_ordinal'], df['score'])

最明智的做法是什么？任何建议，将不胜感激。谢谢！

期望输出：

  student  exam_date  score     slope
0       A 2020-01-31     15     0.028
1       A 2020-02-29     28     0.028
2       A 2020-03-31     17     0.028
3       B 2020-04-30     22     0.285
4       B 2020-05-31     43     0.285
5       B 2020-06-30     40     0.285
6       B 2020-07-31     52     0.285

Answer 1

您可以使用 groupby / apply 模式在一次调用中为每个学生计算斜率，然后使用 merge 或 join 附加回原始数据框。

第 1 步：定义计算每个学生斜率的函数 你基本上已经在你的问题中这样做了：

import pandas as pd
from scipy import stats

def get_slope(df_student):
    """
    Assumes df_student has the columns 'score' and 'date_ordinal' available
    """
    results = stats.linregress(df_student['date_ordinal'], df_student['score'])
    return results.slope

第 2 步：使用 groupby

计算每个学生的斜率

slopes = df.groupby('student').apply(get_slope).rename('slope')

slopes 对象是由学生索引的系列：

第 3 步：连接回原始数据帧

这里我们既可以使用join，也可以使用merge。 join 和 merge 之间的主要区别在于 merge 更灵活，能够同时连接行索引和列，而 join 更简洁并且设计为仅连接行索引。要在此处使用连接，原始数据框的索引需要更改为学生。因此，改为使用合并方法：

df_final = df.merge(slopes, left_on=['student'], right_index=True).drop('date_ordinal', axis=1)

现在 df_final 应该是您想要的输出：

备注：

如果您愿意使用 student 作为原始数据帧的索引，则可以更简洁地实现相同的结果，如下所示：

df = pd.DataFrame({'student': 'A A A B B B B'.split(),
                  'exam_date': pd.date_range(start='1/1/2020', periods=7, freq='M'),
                  'score': [15, 28, 17, 22, 43, 40, 52]})
df['date_ordinal'] = pd.to_datetime(df['exam_date']).map(dt.datetime.toordinal)

df.set_index(['student'], inplace=True, append=False)
df['slope'] = df.groupby('student')\
    .apply(lambda x: stats.linregress(x['date_ordinal'], x['score']).slope)\
    .drop('date_ordinal', axis=1)

创建采用值斜率的新列

Create New Column Taking Slope of Values

python

time-series

scipy

linear-regression

dataframe