查找具有连续但奇数个日期的百分比差异和差异
Find percent diff and diff with consecutive but odd number of dates
我有一个数据集 df,我希望在其中找到差异百分比和差异。我想查看最早的日期并将此值与下一个日期进行比较:
id date value
1 11/01/2020 10
2 11/01/2020 5
1 10/01/2020 20
2 10/01/2020 30
1 09/01/2020 15
2 09/01/2020 10
3 11/01/2020 5
期望输出
id date diff percent
1 10/01/2020 5 33
1 11/01/2020 -10 -50
2 10/01/2020 20 200
2 11/01/2020 -25 -83.33
3 11/01/2020 0 0
我想一次查看一组并将前一个值与下一个值进行比较,找出增加的百分比和差异。
例如,
ID 1,从 09/01/2020 到 10/01/2020:从 15 到 20,
相差 5
百分比差异为 33%
从 10/01/2020 到 11/01/2020: 从 20 到 10,
-10 的差异和 50% 的差异.
这是我正在做的:
a['date'] = pd.to_datetime(a['date'])
grouped = a.sort_values('date').groupby(['id'])
output = pd.DataFrame({
'date': grouped['date'].agg(lambda x: x.iloc[-1]).values,
'diff': grouped['value'].agg(lambda x: x.diff().fillna(0).iloc[-1]).values,
'percentdiff': grouped['value'].agg(lambda x: x.pct_change().fillna(0).iloc[-1] * 100).values,
'type': grouped['id'].agg(lambda x: x.iloc[0]).values
})
但是,我注意到缺少一些值,因为这是我的输出:
是否可以实现我想要的输出?
或许必须执行一个循环来引用前一个日期行并与下一个日期行进行比较?
欢迎任何建议
这是一种解决方法,假设我的逻辑正确:
思路是对每个组使用shift
来计算差异和百分比,
result = (df.sort_values(["id", "date", "value"])
# use this later to drop the first row per group
# if number is greater than 1, else leave as-is
.assign(counter=lambda x: x.groupby("id").date.transform("size"),
date_shift=lambda x: x.groupby(["id"]).date.shift(1),
value_shift=lambda x: x.groupby("id").value.shift(1),
diff=lambda x: x.value - x.value_shift,
percent=lambda x: x["diff"].div(x.value_shift).mul(100).round(2))
# here is where the counter column becomes useful
# drop rows where date_shift is null and counter is > 1
# this way if number of rows in the group is just one it is kept,
# if greater than one, the first row is dropped,
# as the first row would have nulls due to the `shift` method.
.query("not (date_shift.isna() and counter>1)")
.loc[:, ["id", "date", "diff", "percent"]]
.fillna(0))
result
id date diff percent
2 1 10/01/2020 5.0 33.33
0 1 11/01/2020 -10.0 -50.00
3 2 10/01/2020 20.0 200.00
1 2 11/01/2020 -25.0 -83.33
6 3 11/01/2020 0.0 0.00
我有一个数据集 df,我希望在其中找到差异百分比和差异。我想查看最早的日期并将此值与下一个日期进行比较:
id date value
1 11/01/2020 10
2 11/01/2020 5
1 10/01/2020 20
2 10/01/2020 30
1 09/01/2020 15
2 09/01/2020 10
3 11/01/2020 5
期望输出
id date diff percent
1 10/01/2020 5 33
1 11/01/2020 -10 -50
2 10/01/2020 20 200
2 11/01/2020 -25 -83.33
3 11/01/2020 0 0
我想一次查看一组并将前一个值与下一个值进行比较,找出增加的百分比和差异。
例如,
ID 1,从 09/01/2020 到 10/01/2020:从 15 到 20, 相差 5 百分比差异为 33%
从 10/01/2020 到 11/01/2020: 从 20 到 10, -10 的差异和 50% 的差异.
这是我正在做的:
a['date'] = pd.to_datetime(a['date'])
grouped = a.sort_values('date').groupby(['id'])
output = pd.DataFrame({
'date': grouped['date'].agg(lambda x: x.iloc[-1]).values,
'diff': grouped['value'].agg(lambda x: x.diff().fillna(0).iloc[-1]).values,
'percentdiff': grouped['value'].agg(lambda x: x.pct_change().fillna(0).iloc[-1] * 100).values,
'type': grouped['id'].agg(lambda x: x.iloc[0]).values
})
但是,我注意到缺少一些值,因为这是我的输出:
是否可以实现我想要的输出? 或许必须执行一个循环来引用前一个日期行并与下一个日期行进行比较?
欢迎任何建议
这是一种解决方法,假设我的逻辑正确:
思路是对每个组使用shift
来计算差异和百分比,
result = (df.sort_values(["id", "date", "value"])
# use this later to drop the first row per group
# if number is greater than 1, else leave as-is
.assign(counter=lambda x: x.groupby("id").date.transform("size"),
date_shift=lambda x: x.groupby(["id"]).date.shift(1),
value_shift=lambda x: x.groupby("id").value.shift(1),
diff=lambda x: x.value - x.value_shift,
percent=lambda x: x["diff"].div(x.value_shift).mul(100).round(2))
# here is where the counter column becomes useful
# drop rows where date_shift is null and counter is > 1
# this way if number of rows in the group is just one it is kept,
# if greater than one, the first row is dropped,
# as the first row would have nulls due to the `shift` method.
.query("not (date_shift.isna() and counter>1)")
.loc[:, ["id", "date", "diff", "percent"]]
.fillna(0))
result
id date diff percent
2 1 10/01/2020 5.0 33.33
0 1 11/01/2020 -10.0 -50.00
3 2 10/01/2020 20.0 200.00
1 2 11/01/2020 -25.0 -83.33
6 3 11/01/2020 0.0 0.00