如何在特定时间间隔获取最新行以形成数据框?
How to get the latest row at specific time intervals to form a dataframe?
假设 3 个孩子正在比赛,看谁能在几天内卖出最多的糖果、巧克力棒和饼干。他们当天在 08:15:00(上午 8 点 15 分)开始比赛,并同意将他们的销售输入跟踪器,如下面的数据框所示:
df = pd.DataFrame({
'Name': ['Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Khala', 'Harvey', 'Gaddy'],
'Timestamp': ['2022-01-01 08:17:23.12', '2022-01-01 08:22:58.76', '2022-01-01 08:19:02.57', '2022-01-01 08:55:43.99','2022-01-01 08:41:23.10', '2022-01-01 09:14:59.99', '2022-01-01 09:15:02.02', '2022-01-01 09:44:43.30','2022-01-01 09:54:23.71', '2022-01-01 10:15:00.00', '2022-01-01 10:15:02.99', '2022-01-01 10:19:43.52'],
'Candy': [2, 1, 3, 3, 5, 4, 6, 6, 4, 10, 9, 14],
'Chocolate Bars': [4, np.nan, 6, 7, 8, 6, 7, 13, 10, 19, 11, 11],
'Cookies': [1, 1, 4, 2, 4, 5, 5, 8, 11, 8, 15, 17]
})
Name Timestamp Candy Chocolate Bars Cookies
0 Harvey 2022-01-01 08:17:23.12 2 4 1
1 Khala 2022-01-01 08:22:58.76 1 NaN 1
2 Gaddy 2022-01-01 08:19:02.57 3 6 4
3 Harvey 2022-01-01 08:55:43.99 3 7 2
4 Khala 2022-01-01 08:41:23.10 5 8 4
5 Gaddy 2022-01-01 09:14:59.99 4 6 5
6 Harvey 2022-01-01 09:15:02.02 6 7 5
7 Khala 2022-01-01 09:44:43.30 6 13 8
8 Gaddy 2022-01-01 09:54:23.71 4 10 11
9 Khala 2022-01-01 10:15:00.00 10 19 8
10 Harvey 2022-01-01 10:15:02.99 9 11 15
11 Gaddy 2022-01-01 10:19:43.52 14 11 17
现在的目的是创建一个新的数据框,以 1 小时为间隔捕获每个 child 的最新销售(一个小时 window 的示例是 [=16= .00 - 09:14:59.99) 和它们被捕获的 window。这样数据框将如下所示:
Name Window Timestamp Candy Chocolate Bars Cookies
1 Harvey 09:15:00.00 2022-01-01 08:55:43.99 3 7 2
2 Khala 09:15:00.00 2022-01-01 08:41:23.10 5 8 4
3 Gaddy 09:15:00.00 2022-01-01 09:14:59.99 4 6 5
4 Harvey 10:15:00.00 2022-01-01 09:15:02.02 6 7 5
5 Khala 10:15:00.00 2022-01-01 09:44:43.30 6 13 8
6 Gaddy 10:15:00.00 2022-01-01 09:54:23.71 4 10 11
7 Khala 11:15:00.00 2022-01-01 10:15:00.00 10 19 8
8 Harvey 11:15:00.00 2022-01-01 10:15:02.99 9 11 15
9 Gaddy 11:15:00.00 2022-01-01 10:19:43.52 14 11 17
我要做的第一件事是将时间戳列转换为日期时间,使其更易于使用
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Name': ['Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Khala', 'Harvey', 'Gaddy'],
'Timestamp': ['2022-01-01 08:17:23.12', '2022-01-01 08:22:58.76', '2022-01-01 08:19:02.57', '2022-01-01 08:55:43.99','2022-01-01 08:41:23.10', '2022-01-01 09:14:59.99', '2022-01-01 09:15:02.02', '2022-01-01 09:44:43.30','2022-01-01 09:54:23.71', '2022-01-01 10:15:00.00', '2022-01-01 10:15:02.99', '2022-01-01 10:19:43.52'],
'Candy': [2, 1, 3, 3, 5, 4, 6, 6, 4, 10, 9, 14],
'Chocolate Bars': [4, np.nan, 6, 7, 8, 6, 7, 13, 10, 19, 11, 11],
'Cookies': [1, 1, 4, 2, 4, 5, 5, 8, 11, 8, 15, 17]
})
df["Timestamp"] = pd.to_datetime(df["Timestamp"])
然后下一步是添加 window 列
# Get window
window_start = pd.to_timedelta("15min")
df["Window"] = (df["Timestamp"] - window_start).dt.floor("1h") + window_start
您可以通过先将时间偏移 15 分钟来实现,只需花费几个小时,然后再将 15 分钟加回去。如果您不想将日期保留在 window 中,那也是可能的。
最后一步是对时间戳进行排序,每个 window 和每个人
只保留一个时间戳
# Keep only one row per window and person
df = df.sort_values("Timestamp", ascending=False).groupby(["Name", "Window"]).head(1)
df = df.sort_index().reset_index(drop=True)
将 Timestamp
列转换为日期时间后,您可以使用 DataFrame .groupby method combined with .resample 方法:
df["Timestamp"] = pd.to_datetime(df["Timestamp")
cols = ['Candy', 'Chocolate Bars', 'Cookies']
(df
.groupby("Name")
.resample("60T", offset="15T", on="Timestamp", label="right")
.last()
.loc[:, cols]
.reset_index()
.sort_values("Timestamp")
)
Name Timestamp Candy Chocolate Bars Cookies
0 Gaddy 2022-01-01 09:15:00 4 6.0 5
3 Harvey 2022-01-01 09:15:00 3 7.0 2
6 Khala 2022-01-01 09:15:00 5 8.0 4
1 Gaddy 2022-01-01 10:15:00 4 10.0 11
4 Harvey 2022-01-01 10:15:00 6 7.0 5
7 Khala 2022-01-01 10:15:00 6 13.0 8
2 Gaddy 2022-01-01 11:15:00 14 11.0 17
5 Harvey 2022-01-01 11:15:00 9 11.0 15
8 Khala 2022-01-01 11:15:00 10 19.0 8
假设 3 个孩子正在比赛,看谁能在几天内卖出最多的糖果、巧克力棒和饼干。他们当天在 08:15:00(上午 8 点 15 分)开始比赛,并同意将他们的销售输入跟踪器,如下面的数据框所示:
df = pd.DataFrame({
'Name': ['Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Khala', 'Harvey', 'Gaddy'],
'Timestamp': ['2022-01-01 08:17:23.12', '2022-01-01 08:22:58.76', '2022-01-01 08:19:02.57', '2022-01-01 08:55:43.99','2022-01-01 08:41:23.10', '2022-01-01 09:14:59.99', '2022-01-01 09:15:02.02', '2022-01-01 09:44:43.30','2022-01-01 09:54:23.71', '2022-01-01 10:15:00.00', '2022-01-01 10:15:02.99', '2022-01-01 10:19:43.52'],
'Candy': [2, 1, 3, 3, 5, 4, 6, 6, 4, 10, 9, 14],
'Chocolate Bars': [4, np.nan, 6, 7, 8, 6, 7, 13, 10, 19, 11, 11],
'Cookies': [1, 1, 4, 2, 4, 5, 5, 8, 11, 8, 15, 17]
})
Name Timestamp Candy Chocolate Bars Cookies
0 Harvey 2022-01-01 08:17:23.12 2 4 1
1 Khala 2022-01-01 08:22:58.76 1 NaN 1
2 Gaddy 2022-01-01 08:19:02.57 3 6 4
3 Harvey 2022-01-01 08:55:43.99 3 7 2
4 Khala 2022-01-01 08:41:23.10 5 8 4
5 Gaddy 2022-01-01 09:14:59.99 4 6 5
6 Harvey 2022-01-01 09:15:02.02 6 7 5
7 Khala 2022-01-01 09:44:43.30 6 13 8
8 Gaddy 2022-01-01 09:54:23.71 4 10 11
9 Khala 2022-01-01 10:15:00.00 10 19 8
10 Harvey 2022-01-01 10:15:02.99 9 11 15
11 Gaddy 2022-01-01 10:19:43.52 14 11 17
现在的目的是创建一个新的数据框,以 1 小时为间隔捕获每个 child 的最新销售(一个小时 window 的示例是 [=16= .00 - 09:14:59.99) 和它们被捕获的 window。这样数据框将如下所示:
Name Window Timestamp Candy Chocolate Bars Cookies
1 Harvey 09:15:00.00 2022-01-01 08:55:43.99 3 7 2
2 Khala 09:15:00.00 2022-01-01 08:41:23.10 5 8 4
3 Gaddy 09:15:00.00 2022-01-01 09:14:59.99 4 6 5
4 Harvey 10:15:00.00 2022-01-01 09:15:02.02 6 7 5
5 Khala 10:15:00.00 2022-01-01 09:44:43.30 6 13 8
6 Gaddy 10:15:00.00 2022-01-01 09:54:23.71 4 10 11
7 Khala 11:15:00.00 2022-01-01 10:15:00.00 10 19 8
8 Harvey 11:15:00.00 2022-01-01 10:15:02.99 9 11 15
9 Gaddy 11:15:00.00 2022-01-01 10:19:43.52 14 11 17
我要做的第一件事是将时间戳列转换为日期时间,使其更易于使用
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Name': ['Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Khala', 'Harvey', 'Gaddy'],
'Timestamp': ['2022-01-01 08:17:23.12', '2022-01-01 08:22:58.76', '2022-01-01 08:19:02.57', '2022-01-01 08:55:43.99','2022-01-01 08:41:23.10', '2022-01-01 09:14:59.99', '2022-01-01 09:15:02.02', '2022-01-01 09:44:43.30','2022-01-01 09:54:23.71', '2022-01-01 10:15:00.00', '2022-01-01 10:15:02.99', '2022-01-01 10:19:43.52'],
'Candy': [2, 1, 3, 3, 5, 4, 6, 6, 4, 10, 9, 14],
'Chocolate Bars': [4, np.nan, 6, 7, 8, 6, 7, 13, 10, 19, 11, 11],
'Cookies': [1, 1, 4, 2, 4, 5, 5, 8, 11, 8, 15, 17]
})
df["Timestamp"] = pd.to_datetime(df["Timestamp"])
然后下一步是添加 window 列
# Get window
window_start = pd.to_timedelta("15min")
df["Window"] = (df["Timestamp"] - window_start).dt.floor("1h") + window_start
您可以通过先将时间偏移 15 分钟来实现,只需花费几个小时,然后再将 15 分钟加回去。如果您不想将日期保留在 window 中,那也是可能的。
最后一步是对时间戳进行排序,每个 window 和每个人
只保留一个时间戳# Keep only one row per window and person
df = df.sort_values("Timestamp", ascending=False).groupby(["Name", "Window"]).head(1)
df = df.sort_index().reset_index(drop=True)
将 Timestamp
列转换为日期时间后,您可以使用 DataFrame .groupby method combined with .resample 方法:
df["Timestamp"] = pd.to_datetime(df["Timestamp")
cols = ['Candy', 'Chocolate Bars', 'Cookies']
(df
.groupby("Name")
.resample("60T", offset="15T", on="Timestamp", label="right")
.last()
.loc[:, cols]
.reset_index()
.sort_values("Timestamp")
)
Name Timestamp Candy Chocolate Bars Cookies
0 Gaddy 2022-01-01 09:15:00 4 6.0 5
3 Harvey 2022-01-01 09:15:00 3 7.0 2
6 Khala 2022-01-01 09:15:00 5 8.0 4
1 Gaddy 2022-01-01 10:15:00 4 10.0 11
4 Harvey 2022-01-01 10:15:00 6 7.0 5
7 Khala 2022-01-01 10:15:00 6 13.0 8
2 Gaddy 2022-01-01 11:15:00 14 11.0 17
5 Harvey 2022-01-01 11:15:00 9 11.0 15
8 Khala 2022-01-01 11:15:00 10 19.0 8