计算日期值的均值和差异条件
Calculating mean and diff conditional to date values
我有以下数据框,其中给定作业 work_id
由学生 s_id
在日期 work_date
执行,相对分数 score
。对于每个学生,日期按降序排列。
df = pd.DataFrame(columns=['work_id', 's_id', 'score','work_date'],
... data =[['a3', 'p01', np.nan,'2020-05-01'],
... ['a2', 'p01',10,'2020-06-10'],
... ['a1','p01', 5, '2020-06-15'],
... ['a5','p02', 5, '2019-10-10'],
... ['a7','p02', 11, '2020-03-01'],
... ['a6','p02', np.nan, '2020-04-01'],
... ['a4','p02', 4, '2020-06-20'],
... ])
>>> df
work_id s_id score work_date
0 a3 p01 NaN 2020-05-01
1 a2 p01 10.0 2020-06-10
2 a1 p01 5.0 2020-06-15
3 a5 p02 5.0 2019-10-10
4 a7 p02 11.0 2020-03-01
5 a6 p02 NaN 2020-04-01
6 a4 p02 4.0 2020-06-20
我想添加两列:mean_score
和 diff_score
。 mean_score
列应显示每个学生获得的平均分数,其中计算平均值包括以前作业中获得的所有分数。 diff_score
列应包含当前分数与前一个分数(不是 NaN)之间的差异。因此,最终数据框必须如下所示:
work_id s_id score work_date mean_score diff_score
0 a3 p01 9.0 2020-05-01 NaN NaN
1 a2 p01 10.0 2020-06-10 10.00000 NaN
2 a1 p01 5.0 2020-06-15 7.500000 -5.0
3 a5 p02 5.0 2019-10-10 5.000000 NaN
4 a7 p02 11.0 2020-03-01 8.000000 6.0
5 a6 p02 NaN 2020-04-01 NaN NaN
6 a4 p02 4.0 2020-06-20 6.666667 -7.0
我可以通过定义以下两个函数(处理可能存在的 NaN 条目)并使用 apply/lambda:
def calculate_mean(workid):
date = df[df.work_id == workid].work_date.iloc[0]
sid = df[df.work_id == workid].s_id.iloc[0]
if df[(df.work_id==workid) & (df.s_id==sid) & (df.work_date == date)].score.notnull().item():
mean = df[(df.s_id == sid) & (df.work_date <= date)].score.mean()
else:
mean = np.nan
return mean
def calculate_diff(workid):
date = df[df.work_id == workid].work_date.iloc[0]
sid = df[df.work_id == workid].s_id.iloc[0]
try:
if df[(df.s_id==sid) & (df.work_date == date)].score.notnull().item():
delta = df[(df.s_id == sid) & (df.work_date <= date) & (df.score.notnull())].score.diff().iloc[-1]
else:
delta = np.nan
except:
delta = np.nan
return delta
df['mean_score'] = df['work_id'].apply(lambda x: calculate_mean(x) )
df['diff_score'] = df['work_id'].apply(lambda x: calculate_diff(x) )
我需要一种更有效的方法(可能使用 groupby),因为这种方法在大型数据帧上非常慢。
IIUC,将 pandas.DataFrame.groupby
与 expanding.mean
和 diff
一起使用:
g = df.groupby("s_id")["score"]
s1 = g.apply(lambda x: x.dropna().expanding().mean())
s2 = g.apply(lambda x: x.dropna().diff())
df["mean_score"] = s1.reset_index(level=0, drop=True)
df["diff_score"] = s2.reset_index(level=0, drop=True)
print(df)
或者做一个函数:
def mean_and_diff(series):
s = series.dropna()
d = {"mean_score": s.expanding().mean(), "diff_score": s.diff()}
return pd.DataFrame(d)
tmp = df.groupby("s_id")["score"].apply(mean_and_diff).reset_index(level=0, drop=True)
df[["mean_score", "diff_score"]] = tmp[["mean_score", "diff_score"]]
输出:
work_id s_id score work_date mean_score diff_score
0 a3 p01 NaN 2020-05-01 NaN NaN
1 a2 p01 10.0 2020-06-10 10.000000 NaN
2 a1 p01 5.0 2020-06-15 7.500000 -5.0
3 a5 p02 5.0 2019-10-10 5.000000 NaN
4 a7 p02 11.0 2020-03-01 8.000000 6.0
5 a6 p02 NaN 2020-04-01 NaN NaN
6 a4 p02 4.0 2020-06-20 6.666667 -7.0
我有以下数据框,其中给定作业 work_id
由学生 s_id
在日期 work_date
执行,相对分数 score
。对于每个学生,日期按降序排列。
df = pd.DataFrame(columns=['work_id', 's_id', 'score','work_date'],
... data =[['a3', 'p01', np.nan,'2020-05-01'],
... ['a2', 'p01',10,'2020-06-10'],
... ['a1','p01', 5, '2020-06-15'],
... ['a5','p02', 5, '2019-10-10'],
... ['a7','p02', 11, '2020-03-01'],
... ['a6','p02', np.nan, '2020-04-01'],
... ['a4','p02', 4, '2020-06-20'],
... ])
>>> df
work_id s_id score work_date
0 a3 p01 NaN 2020-05-01
1 a2 p01 10.0 2020-06-10
2 a1 p01 5.0 2020-06-15
3 a5 p02 5.0 2019-10-10
4 a7 p02 11.0 2020-03-01
5 a6 p02 NaN 2020-04-01
6 a4 p02 4.0 2020-06-20
我想添加两列:mean_score
和 diff_score
。 mean_score
列应显示每个学生获得的平均分数,其中计算平均值包括以前作业中获得的所有分数。 diff_score
列应包含当前分数与前一个分数(不是 NaN)之间的差异。因此,最终数据框必须如下所示:
work_id s_id score work_date mean_score diff_score
0 a3 p01 9.0 2020-05-01 NaN NaN
1 a2 p01 10.0 2020-06-10 10.00000 NaN
2 a1 p01 5.0 2020-06-15 7.500000 -5.0
3 a5 p02 5.0 2019-10-10 5.000000 NaN
4 a7 p02 11.0 2020-03-01 8.000000 6.0
5 a6 p02 NaN 2020-04-01 NaN NaN
6 a4 p02 4.0 2020-06-20 6.666667 -7.0
我可以通过定义以下两个函数(处理可能存在的 NaN 条目)并使用 apply/lambda:
def calculate_mean(workid):
date = df[df.work_id == workid].work_date.iloc[0]
sid = df[df.work_id == workid].s_id.iloc[0]
if df[(df.work_id==workid) & (df.s_id==sid) & (df.work_date == date)].score.notnull().item():
mean = df[(df.s_id == sid) & (df.work_date <= date)].score.mean()
else:
mean = np.nan
return mean
def calculate_diff(workid):
date = df[df.work_id == workid].work_date.iloc[0]
sid = df[df.work_id == workid].s_id.iloc[0]
try:
if df[(df.s_id==sid) & (df.work_date == date)].score.notnull().item():
delta = df[(df.s_id == sid) & (df.work_date <= date) & (df.score.notnull())].score.diff().iloc[-1]
else:
delta = np.nan
except:
delta = np.nan
return delta
df['mean_score'] = df['work_id'].apply(lambda x: calculate_mean(x) )
df['diff_score'] = df['work_id'].apply(lambda x: calculate_diff(x) )
我需要一种更有效的方法(可能使用 groupby),因为这种方法在大型数据帧上非常慢。
IIUC,将 pandas.DataFrame.groupby
与 expanding.mean
和 diff
一起使用:
g = df.groupby("s_id")["score"]
s1 = g.apply(lambda x: x.dropna().expanding().mean())
s2 = g.apply(lambda x: x.dropna().diff())
df["mean_score"] = s1.reset_index(level=0, drop=True)
df["diff_score"] = s2.reset_index(level=0, drop=True)
print(df)
或者做一个函数:
def mean_and_diff(series):
s = series.dropna()
d = {"mean_score": s.expanding().mean(), "diff_score": s.diff()}
return pd.DataFrame(d)
tmp = df.groupby("s_id")["score"].apply(mean_and_diff).reset_index(level=0, drop=True)
df[["mean_score", "diff_score"]] = tmp[["mean_score", "diff_score"]]
输出:
work_id s_id score work_date mean_score diff_score
0 a3 p01 NaN 2020-05-01 NaN NaN
1 a2 p01 10.0 2020-06-10 10.000000 NaN
2 a1 p01 5.0 2020-06-15 7.500000 -5.0
3 a5 p02 5.0 2019-10-10 5.000000 NaN
4 a7 p02 11.0 2020-03-01 8.000000 6.0
5 a6 p02 NaN 2020-04-01 NaN NaN
6 a4 p02 4.0 2020-06-20 6.666667 -7.0