映射两个数据帧,计算第二个数据帧中的时间戳在第一个数据帧的日期时间范围内的事件

Map two dataframes, count events where timestamps in second dataframe are within the date-time ranges of the first dataframe

我有一个数据框 df 记录人们从一个位置到另一个位置的移动。我有第二个数据框 df2 记录特定时间范围内的事件。对于 df 中的每个条目,如果事件 timestamp 发生在 start_tsend_ts 之间,我想计算每个事件类型的事件数,因为 person_id的比赛。

import pandas as pd
import numpy as np

current_time = '2022-05-05 17:00'

df = pd.DataFrame({"person_id": ["A", "B", "A", "C", "A", "C"],
                   "location_id": [1, 5, 2, 7, 3, 8],
                   "start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
                   "end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05  12:00"]),
})

df2 = pd.DataFrame({"person_id": ["A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "C"],
                    "timestamp": pd.to_datetime(["2022-05-05 01:00", "2022-05-05 01:10", "2022-05-05 01:30", "2022-05-05 06:00",
                                             "2022-05-05 07:00", "2022-05-05 08:00", "2022-05-05 13:00", "2022-05-05 14:00",
                                             "2022-05-05 15:00", "2022-05-05 01:00", "2022-05-05 01:30", "2022-05-05 02:00",
                                             "2022-05-05 01:00", "2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 11:10",
                                             "2022-05-05 11:20", "2022-05-05 11:30"]),
                    "event": ["1", "2", "3", "1", "2", "2", "3", "1", "1", "2", "3", "3", "1", "1", "3", "2", "3", "1"],
})

df 看起来如下:

print(df.to_string())
>>
  person_id  location_id            start_ts              end_ts
0         A            1 2022-05-05 00:00:00 2022-05-05 02:00:00
1         B            5 2022-05-05 00:00:00 2022-05-05 03:00:00
2         A            2 2022-05-05 05:00:00 2022-05-05 10:00:00
3         C            7 2022-05-05 00:00:00 2022-05-05 04:00:00
4         A            3 2022-05-05 13:00:00 2022-05-05 16:00:00
5         C            8 2022-05-05 11:00:00 2022-05-05 12:00:00

df2 看起来如下:

print(df2.to_string())
>>
   person_id           timestamp event
0          A 2022-05-05 01:00:00     1
1          A 2022-05-05 01:10:00     2
2          A 2022-05-05 01:30:00     3
3          A 2022-05-05 06:00:00     1
4          A 2022-05-05 07:00:00     2
5          A 2022-05-05 08:00:00     2
6          A 2022-05-05 13:00:00     3
7          A 2022-05-05 14:00:00     1
8          A 2022-05-05 15:00:00     1
9          B 2022-05-05 01:00:00     2
10         B 2022-05-05 01:30:00     3
11         B 2022-05-05 02:00:00     3
12         C 2022-05-05 01:00:00     1
13         C 2022-05-05 02:00:00     1
14         C 2022-05-05 03:00:00     3
15         C 2022-05-05 11:10:00     2
16         C 2022-05-05 11:20:00     3
17         C 2022-05-05 11:30:00     1

我尝试进行数据透视类型计数如下:

df2 = df2.sample(frac=1)

for i in df.index:
    for j in np.unique(df2["event"]):
        df.loc[i, "count_event_" + j] = len(df2.loc[(df2["timestamp"].between(df.loc[i, "start_ts"], df.loc[i, "end_ts"], inclusive = "both")) & 
                                                (df2["event"] == j) &
                                                (df2["person_id"] == df.loc[i, "person_id"])])

我想要的输出是:

print(df.to_string())
>>
  person_id  location_id            start_ts              end_ts  count_event_1  count_event_2      count_event_3
0         A            1 2022-05-05 00:00:00 2022-05-05 02:00:00            1.0            1.0            1.0
1         B            5 2022-05-05 00:00:00 2022-05-05 03:00:00            0.0            1.0            2.0
2         A            2 2022-05-05 05:00:00 2022-05-05 10:00:00            1.0            2.0            0.0
3         C            7 2022-05-05 00:00:00 2022-05-05 04:00:00            2.0            0.0            1.0
4         A            3 2022-05-05 13:00:00 2022-05-05 16:00:00            2.0            0.0            1.0
5         C            8 2022-05-05 11:00:00 2022-05-05 12:00:00            1.0            1.0            1.0

我想摆脱嵌套的 for 循环,而是使用 group by 或 pivot 获得相同的结果。我怎样才能有效地实现这一目标,特别是考虑到每个实际数据帧都有几百万个条目。

这是一种方法(我离开了中间 print-outs 以帮助更好地理解每一步发生的事情):

# Setup
df["start_ts"] = pd.to_datetime(df["start_ts"], format="%Y-%m-%d %H:%M:%s")
df["end_ts"] = pd.to_datetime(df["end_ts"], format="%Y-%m-%d %H:%M:%s")
df2["timestamp"] = pd.to_datetime(df2["timestamp"], format="%Y-%m-%d %H:%M:%s")
# Find matching intervals
for idx in df["person_id"].unique():
    df2.loc[df2["person_id"] == idx, "interval"] = df2.loc[
        df2["person_id"] == idx, "timestamp"
    ].map(
        lambda x: df.loc[
            (df["person_id"] == idx) & (df["start_ts"] <= x) & (x <= df["end_ts"]),
            ["start_ts", "end_ts"],
        ].index[0]
    )

print(df2.head())
# Output
  person_id           timestamp event  interval
0         A 2022-05-05 01:00:00     1       0.0
1         A 2022-05-05 01:10:00     2       0.0
2         A 2022-05-05 01:30:00     3       0.0
3         A 2022-05-05 06:00:00     1       2.0
4         A 2022-05-05 07:00:00     2       2.0
# Count number of events
df2 = (
    df2.assign(interval=lambda x: x["interval"].astype(int))
    .assign(count=1)
    .groupby(["person_id", "interval", "event"])
    .agg({"count": sum})
    .reset_index(drop=False)
    .pivot(index=["person_id", "interval"], columns="event")
    .reset_index(drop=False)
)
df2.columns = df2.columns.droplevel()
df2.columns = [
    "person_id",
    "interval",
    "count_event_1",
    "count_event_2",
    "count_event_3",
]

print(df2)
# Output
  person_id  interval    1    2    3
0         A         0  1.0  1.0  1.0
1         A         2  1.0  2.0  NaN
2         A         4  2.0  NaN  1.0
3         B         1  NaN  1.0  2.0
4         C         3  2.0  NaN  1.0
5         C         5  1.0  1.0  1.0
# Final dataframe
df = df.reset_index(drop=False).rename(columns={"index": "interval"})
df = (
    pd.merge(left=df, right=df2, on=["person_id", "interval"])
    .fillna(0)
    .astype(int, errors="ignore")
)

print(df)
# Output as expected
   interval person_id  location_id            start_ts              end_ts  count_event_1  count_event_2  count_event_3
0         0         A            1 2022-05-05 00:00:00 2022-05-05 02:00:00              1              1              1
1         1         B            5 2022-05-05 00:00:00 2022-05-05 03:00:00              0              1              2
2         2         A            2 2022-05-05 05:00:00 2022-05-05 10:00:00              1              2              0
3         3         C            7 2022-05-05 00:00:00 2022-05-05 04:00:00              2              0              1
4         4         A            3 2022-05-05 13:00:00 2022-05-05 16:00:00              2              0              1
5         5         C            8 2022-05-05 11:00:00 2022-05-05 12:00:00              1              1              1