Pandas分组日期滚动中最新非空值的日期索引
Pandas date index of latest non null value in grouped date rolling
我正在尝试按组获取滚动时间 window 中某个值不为空的最新日期。它在没有分组的情况下工作得很好,但似乎分组会打乱一切。
这是可重现的例子:
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame({})
df["date"] = [dt(2020, 10, i+1) for i in range(10)]
df["group"] = ["a" if int(i/3) == (i/3) else "b" for i in range(10)]
df["value"] = [i if int(i/2) == (i/2) else np.nan for i in range(10)]
数据帧
date group value
0 2020-10-01 a 0.0
1 2020-10-02 b NaN
2 2020-10-03 b 2.0
3 2020-10-04 a NaN
4 2020-10-05 b 4.0
5 2020-10-06 b NaN
6 2020-10-07 a 6.0
7 2020-10-08 b NaN
8 2020-10-09 b 8.0
9 2020-10-10 a NaN
目标输出:
date group value output
0 2020-10-01 a 0.0 2020-10-01
1 2020-10-02 b NaN NaT
2 2020-10-03 b 2.0 2020-10-03
3 2020-10-04 a NaN 2020-10-01
4 2020-10-05 b 4.0 2020-10-05
5 2020-10-06 b NaN 2020-10-05
6 2020-10-07 a 6.0 2020-10-07
7 2020-10-08 b NaN 2020-10-05
8 2020-10-09 b 8.0 2020-10-09
9 2020-10-10 a NaN 2020-10-07
我的尝试:
df = df.set_index("date").sort_index(ascending = True)
def latest_non_null_value_index(x):
y = x[np.isnan(x) == False]
print(y.index)
if len(y) > 0:
return y.index[-1]
else:
return np.nan
latest_index = df\
.groupby(["group"])\
.rolling("35D")\
["value"]\
.apply(lambda x: latest_non_null_value_index(x).timestamp())\
.reset_index()
def to_datetime_from_timestamp(x):
if pd.isnull(x) == False:
return dt.fromtimestamp(x)
else:
return pd.NaT
latest_index["value"] = latest_index["value"]\
.apply(to_datetime_from_timestamp)
我得到的:
group date value
0 a 2020-10-01 2020-10-01 02:00:00
1 a 2020-10-04 2020-10-01 02:00:00
2 a 2020-10-07 2020-10-03 02:00:00
3 a 2020-10-10 2020-10-03 02:00:00
4 b 2020-10-02 NaT
5 b 2020-10-03 2020-10-06 02:00:00
6 b 2020-10-05 2020-10-07 02:00:00
7 b 2020-10-06 2020-10-07 02:00:00
8 b 2020-10-08 2020-10-07 02:00:00
9 b 2020-10-09 2020-10-10 02:00:00
知道我错过了什么吗?
编辑:我在获取最新值时似乎也没有这个问题......这确实与索引有关。
EDIT2:如果我能以某种方式将函数应用于 2 列,我可以将日期作为第二列并获得解决方法
您可以使用 pd.fillna
和“ffill”来向前填充缺失值
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame({})
df["date"] = [dt(2020, 10, i+1) for i in range(10)]
df["group"] = ["a" if int(i/3) == (i/3) else "b" for i in range(10)]
df["value"] = [i if int(i/2) == (i/2) else np.nan for i in range(10)]
df = df.sort_values("date") # Just make sure that row are properly ordered
date = df["date"].copy()
date[df.value.isna()] = pd.NaT
latest_index = date.groupby(df.group).fillna(method="ffill")
这不会处理您的滚动时间范围,但您可以像这样删除时间 window 之外的值:
latest_index[(df.date - latest_index).dt.days > 35] = pd.NaT
但这不是超级整洁,因此您可以尝试使用最大聚合来对抗滚动 window,如下所示:
df = df.set_index("date", drop=False)
df = df.sort_index()
date = pd.to_numeric(df["date"].copy()) # it wasn't letting me aggregate dates so we have to convert to float then back to dates
date[df.value.isna()] = None
latest_index = date.groupby(df.group).rolling("35D").max()
latest_index = pd.to_datetime(latest_index)
我正在尝试按组获取滚动时间 window 中某个值不为空的最新日期。它在没有分组的情况下工作得很好,但似乎分组会打乱一切。
这是可重现的例子:
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame({})
df["date"] = [dt(2020, 10, i+1) for i in range(10)]
df["group"] = ["a" if int(i/3) == (i/3) else "b" for i in range(10)]
df["value"] = [i if int(i/2) == (i/2) else np.nan for i in range(10)]
数据帧
date group value
0 2020-10-01 a 0.0
1 2020-10-02 b NaN
2 2020-10-03 b 2.0
3 2020-10-04 a NaN
4 2020-10-05 b 4.0
5 2020-10-06 b NaN
6 2020-10-07 a 6.0
7 2020-10-08 b NaN
8 2020-10-09 b 8.0
9 2020-10-10 a NaN
目标输出:
date group value output
0 2020-10-01 a 0.0 2020-10-01
1 2020-10-02 b NaN NaT
2 2020-10-03 b 2.0 2020-10-03
3 2020-10-04 a NaN 2020-10-01
4 2020-10-05 b 4.0 2020-10-05
5 2020-10-06 b NaN 2020-10-05
6 2020-10-07 a 6.0 2020-10-07
7 2020-10-08 b NaN 2020-10-05
8 2020-10-09 b 8.0 2020-10-09
9 2020-10-10 a NaN 2020-10-07
我的尝试:
df = df.set_index("date").sort_index(ascending = True)
def latest_non_null_value_index(x):
y = x[np.isnan(x) == False]
print(y.index)
if len(y) > 0:
return y.index[-1]
else:
return np.nan
latest_index = df\
.groupby(["group"])\
.rolling("35D")\
["value"]\
.apply(lambda x: latest_non_null_value_index(x).timestamp())\
.reset_index()
def to_datetime_from_timestamp(x):
if pd.isnull(x) == False:
return dt.fromtimestamp(x)
else:
return pd.NaT
latest_index["value"] = latest_index["value"]\
.apply(to_datetime_from_timestamp)
我得到的:
group date value
0 a 2020-10-01 2020-10-01 02:00:00
1 a 2020-10-04 2020-10-01 02:00:00
2 a 2020-10-07 2020-10-03 02:00:00
3 a 2020-10-10 2020-10-03 02:00:00
4 b 2020-10-02 NaT
5 b 2020-10-03 2020-10-06 02:00:00
6 b 2020-10-05 2020-10-07 02:00:00
7 b 2020-10-06 2020-10-07 02:00:00
8 b 2020-10-08 2020-10-07 02:00:00
9 b 2020-10-09 2020-10-10 02:00:00
知道我错过了什么吗?
编辑:我在获取最新值时似乎也没有这个问题......这确实与索引有关。
EDIT2:如果我能以某种方式将函数应用于 2 列,我可以将日期作为第二列并获得解决方法
您可以使用 pd.fillna
和“ffill”来向前填充缺失值
import pandas as pd
from datetime import datetime as dt
import numpy as np
df = pd.DataFrame({})
df["date"] = [dt(2020, 10, i+1) for i in range(10)]
df["group"] = ["a" if int(i/3) == (i/3) else "b" for i in range(10)]
df["value"] = [i if int(i/2) == (i/2) else np.nan for i in range(10)]
df = df.sort_values("date") # Just make sure that row are properly ordered
date = df["date"].copy()
date[df.value.isna()] = pd.NaT
latest_index = date.groupby(df.group).fillna(method="ffill")
这不会处理您的滚动时间范围,但您可以像这样删除时间 window 之外的值:
latest_index[(df.date - latest_index).dt.days > 35] = pd.NaT
但这不是超级整洁,因此您可以尝试使用最大聚合来对抗滚动 window,如下所示:
df = df.set_index("date", drop=False)
df = df.sort_index()
date = pd.to_numeric(df["date"].copy()) # it wasn't letting me aggregate dates so we have to convert to float then back to dates
date[df.value.isna()] = None
latest_index = date.groupby(df.group).rolling("35D").max()
latest_index = pd.to_datetime(latest_index)