如何在特定时间间隔获取最新行以形成数据框?

How to get the latest row at specific time intervals to form a dataframe?

假设 3 个孩子正在比赛,看谁能在几天内卖出最多的糖果、巧克力棒和饼干。他们当天在 08:15:00(上午 8 点 15 分)开始比赛,并同意将他们的销售输入跟踪器,如下面的数据框所示:

df = pd.DataFrame({
'Name': ['Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Khala', 'Harvey', 'Gaddy'],
'Timestamp': ['2022-01-01 08:17:23.12', '2022-01-01 08:22:58.76', '2022-01-01 08:19:02.57', '2022-01-01 08:55:43.99','2022-01-01 08:41:23.10', '2022-01-01 09:14:59.99', '2022-01-01 09:15:02.02', '2022-01-01 09:44:43.30','2022-01-01 09:54:23.71', '2022-01-01 10:15:00.00', '2022-01-01 10:15:02.99', '2022-01-01 10:19:43.52'],
'Candy': [2, 1, 3, 3, 5, 4, 6, 6, 4, 10, 9, 14],
'Chocolate Bars': [4, np.nan, 6, 7, 8, 6, 7, 13, 10, 19, 11, 11],
'Cookies': [1, 1, 4, 2, 4, 5, 5, 8, 11, 8, 15, 17]
})

    Name    Timestamp   Candy   Chocolate Bars  Cookies
0   Harvey  2022-01-01 08:17:23.12  2   4   1
1   Khala   2022-01-01 08:22:58.76  1   NaN 1
2   Gaddy   2022-01-01 08:19:02.57  3   6   4
3   Harvey  2022-01-01 08:55:43.99  3   7   2
4   Khala   2022-01-01 08:41:23.10  5   8   4
5   Gaddy   2022-01-01 09:14:59.99  4   6   5
6   Harvey  2022-01-01 09:15:02.02  6   7   5
7   Khala   2022-01-01 09:44:43.30  6   13  8
8   Gaddy   2022-01-01 09:54:23.71  4   10  11
9   Khala   2022-01-01 10:15:00.00  10  19  8
10  Harvey  2022-01-01 10:15:02.99  9   11  15
11  Gaddy   2022-01-01 10:19:43.52  14  11  17

现在的目的是创建一个新的数据框,以 1 小时为间隔捕获每个 child 的最新销售(一个小时 window 的示例是 [=16= .00 - 09:14:59.99) 和它们被捕获的 window。这样数据框将如下所示:

    Name    Window  Timestamp   Candy   Chocolate Bars  Cookies
1   Harvey  09:15:00.00 2022-01-01 08:55:43.99  3   7   2
2   Khala   09:15:00.00 2022-01-01 08:41:23.10  5   8   4
3   Gaddy   09:15:00.00 2022-01-01 09:14:59.99  4   6   5
4   Harvey  10:15:00.00 2022-01-01 09:15:02.02  6   7   5
5   Khala   10:15:00.00 2022-01-01 09:44:43.30  6   13  8
6   Gaddy   10:15:00.00 2022-01-01 09:54:23.71  4   10  11
7   Khala   11:15:00.00 2022-01-01 10:15:00.00  10  19  8
8   Harvey  11:15:00.00 2022-01-01 10:15:02.99  9   11  15
9   Gaddy   11:15:00.00 2022-01-01 10:19:43.52  14  11  17

我要做的第一件事是将时间戳列转换为日期时间,使其更易于使用

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'Name': ['Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Harvey', 'Khala', 'Gaddy', 'Khala', 'Harvey', 'Gaddy'],
    'Timestamp': ['2022-01-01 08:17:23.12', '2022-01-01 08:22:58.76', '2022-01-01 08:19:02.57', '2022-01-01 08:55:43.99','2022-01-01 08:41:23.10', '2022-01-01 09:14:59.99', '2022-01-01 09:15:02.02', '2022-01-01 09:44:43.30','2022-01-01 09:54:23.71', '2022-01-01 10:15:00.00', '2022-01-01 10:15:02.99', '2022-01-01 10:19:43.52'],
    'Candy': [2, 1, 3, 3, 5, 4, 6, 6, 4, 10, 9, 14],
    'Chocolate Bars': [4, np.nan, 6, 7, 8, 6, 7, 13, 10, 19, 11, 11],
    'Cookies': [1, 1, 4, 2, 4, 5, 5, 8, 11, 8, 15, 17]
})
df["Timestamp"] = pd.to_datetime(df["Timestamp"]) 

然后下一步是添加 window 列

# Get window
window_start = pd.to_timedelta("15min")
df["Window"] = (df["Timestamp"] - window_start).dt.floor("1h") + window_start

您可以通过先将时间偏移 15 分钟来实现,只需花费几个小时,然后再将 15 分钟加回去。如果您不想将日期保留在 window 中,那也是可能的。

最后一步是对时间戳进行排序,每个 window 和每个人

只保留一个时间戳
# Keep only one row per window and person
df = df.sort_values("Timestamp", ascending=False).groupby(["Name", "Window"]).head(1)
df = df.sort_index().reset_index(drop=True)

Timestamp 列转换为日期时间后,您可以使用 DataFrame .groupby method combined with .resample 方法:

df["Timestamp"] = pd.to_datetime(df["Timestamp")
cols = ['Candy', 'Chocolate Bars', 'Cookies']

(df
.groupby("Name")
.resample("60T", offset="15T", on="Timestamp", label="right")
.last()
.loc[:, cols]
.reset_index()
.sort_values("Timestamp")
)

Name    Timestamp   Candy   Chocolate Bars  Cookies
0   Gaddy   2022-01-01 09:15:00 4   6.0      5
3   Harvey  2022-01-01 09:15:00 3   7.0      2
6   Khala   2022-01-01 09:15:00 5   8.0      4
1   Gaddy   2022-01-01 10:15:00 4   10.0    11
4   Harvey  2022-01-01 10:15:00 6   7.0      5
7   Khala   2022-01-01 10:15:00 6   13.0     8
2   Gaddy   2022-01-01 11:15:00 14  11.0    17
5   Harvey  2022-01-01 11:15:00 9   11.0    15
8   Khala   2022-01-01 11:15:00 10  19.0     8