映射两个数据帧,计算第二个数据帧中的时间戳在第一个数据帧的日期时间范围内的事件
Map two dataframes, count events where timestamps in second dataframe are within the date-time ranges of the first dataframe
我有一个数据框 df
记录人们从一个位置到另一个位置的移动。我有第二个数据框 df2
记录特定时间范围内的事件。对于 df
中的每个条目,如果事件 timestamp
发生在 start_ts
和 end_ts
之间,我想计算每个事件类型的事件数,因为 person_id
的比赛。
import pandas as pd
import numpy as np
current_time = '2022-05-05 17:00'
df = pd.DataFrame({"person_id": ["A", "B", "A", "C", "A", "C"],
"location_id": [1, 5, 2, 7, 3, 8],
"start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
"end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05 12:00"]),
})
df2 = pd.DataFrame({"person_id": ["A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "C"],
"timestamp": pd.to_datetime(["2022-05-05 01:00", "2022-05-05 01:10", "2022-05-05 01:30", "2022-05-05 06:00",
"2022-05-05 07:00", "2022-05-05 08:00", "2022-05-05 13:00", "2022-05-05 14:00",
"2022-05-05 15:00", "2022-05-05 01:00", "2022-05-05 01:30", "2022-05-05 02:00",
"2022-05-05 01:00", "2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 11:10",
"2022-05-05 11:20", "2022-05-05 11:30"]),
"event": ["1", "2", "3", "1", "2", "2", "3", "1", "1", "2", "3", "3", "1", "1", "3", "2", "3", "1"],
})
df
看起来如下:
print(df.to_string())
>>
person_id location_id start_ts end_ts
0 A 1 2022-05-05 00:00:00 2022-05-05 02:00:00
1 B 5 2022-05-05 00:00:00 2022-05-05 03:00:00
2 A 2 2022-05-05 05:00:00 2022-05-05 10:00:00
3 C 7 2022-05-05 00:00:00 2022-05-05 04:00:00
4 A 3 2022-05-05 13:00:00 2022-05-05 16:00:00
5 C 8 2022-05-05 11:00:00 2022-05-05 12:00:00
df2
看起来如下:
print(df2.to_string())
>>
person_id timestamp event
0 A 2022-05-05 01:00:00 1
1 A 2022-05-05 01:10:00 2
2 A 2022-05-05 01:30:00 3
3 A 2022-05-05 06:00:00 1
4 A 2022-05-05 07:00:00 2
5 A 2022-05-05 08:00:00 2
6 A 2022-05-05 13:00:00 3
7 A 2022-05-05 14:00:00 1
8 A 2022-05-05 15:00:00 1
9 B 2022-05-05 01:00:00 2
10 B 2022-05-05 01:30:00 3
11 B 2022-05-05 02:00:00 3
12 C 2022-05-05 01:00:00 1
13 C 2022-05-05 02:00:00 1
14 C 2022-05-05 03:00:00 3
15 C 2022-05-05 11:10:00 2
16 C 2022-05-05 11:20:00 3
17 C 2022-05-05 11:30:00 1
我尝试进行数据透视类型计数如下:
df2 = df2.sample(frac=1)
for i in df.index:
for j in np.unique(df2["event"]):
df.loc[i, "count_event_" + j] = len(df2.loc[(df2["timestamp"].between(df.loc[i, "start_ts"], df.loc[i, "end_ts"], inclusive = "both")) &
(df2["event"] == j) &
(df2["person_id"] == df.loc[i, "person_id"])])
我想要的输出是:
print(df.to_string())
>>
person_id location_id start_ts end_ts count_event_1 count_event_2 count_event_3
0 A 1 2022-05-05 00:00:00 2022-05-05 02:00:00 1.0 1.0 1.0
1 B 5 2022-05-05 00:00:00 2022-05-05 03:00:00 0.0 1.0 2.0
2 A 2 2022-05-05 05:00:00 2022-05-05 10:00:00 1.0 2.0 0.0
3 C 7 2022-05-05 00:00:00 2022-05-05 04:00:00 2.0 0.0 1.0
4 A 3 2022-05-05 13:00:00 2022-05-05 16:00:00 2.0 0.0 1.0
5 C 8 2022-05-05 11:00:00 2022-05-05 12:00:00 1.0 1.0 1.0
我想摆脱嵌套的 for 循环,而是使用 group by 或 pivot 获得相同的结果。我怎样才能有效地实现这一目标,特别是考虑到每个实际数据帧都有几百万个条目。
这是一种方法(我离开了中间 print-outs 以帮助更好地理解每一步发生的事情):
# Setup
df["start_ts"] = pd.to_datetime(df["start_ts"], format="%Y-%m-%d %H:%M:%s")
df["end_ts"] = pd.to_datetime(df["end_ts"], format="%Y-%m-%d %H:%M:%s")
df2["timestamp"] = pd.to_datetime(df2["timestamp"], format="%Y-%m-%d %H:%M:%s")
# Find matching intervals
for idx in df["person_id"].unique():
df2.loc[df2["person_id"] == idx, "interval"] = df2.loc[
df2["person_id"] == idx, "timestamp"
].map(
lambda x: df.loc[
(df["person_id"] == idx) & (df["start_ts"] <= x) & (x <= df["end_ts"]),
["start_ts", "end_ts"],
].index[0]
)
print(df2.head())
# Output
person_id timestamp event interval
0 A 2022-05-05 01:00:00 1 0.0
1 A 2022-05-05 01:10:00 2 0.0
2 A 2022-05-05 01:30:00 3 0.0
3 A 2022-05-05 06:00:00 1 2.0
4 A 2022-05-05 07:00:00 2 2.0
# Count number of events
df2 = (
df2.assign(interval=lambda x: x["interval"].astype(int))
.assign(count=1)
.groupby(["person_id", "interval", "event"])
.agg({"count": sum})
.reset_index(drop=False)
.pivot(index=["person_id", "interval"], columns="event")
.reset_index(drop=False)
)
df2.columns = df2.columns.droplevel()
df2.columns = [
"person_id",
"interval",
"count_event_1",
"count_event_2",
"count_event_3",
]
print(df2)
# Output
person_id interval 1 2 3
0 A 0 1.0 1.0 1.0
1 A 2 1.0 2.0 NaN
2 A 4 2.0 NaN 1.0
3 B 1 NaN 1.0 2.0
4 C 3 2.0 NaN 1.0
5 C 5 1.0 1.0 1.0
# Final dataframe
df = df.reset_index(drop=False).rename(columns={"index": "interval"})
df = (
pd.merge(left=df, right=df2, on=["person_id", "interval"])
.fillna(0)
.astype(int, errors="ignore")
)
print(df)
# Output as expected
interval person_id location_id start_ts end_ts count_event_1 count_event_2 count_event_3
0 0 A 1 2022-05-05 00:00:00 2022-05-05 02:00:00 1 1 1
1 1 B 5 2022-05-05 00:00:00 2022-05-05 03:00:00 0 1 2
2 2 A 2 2022-05-05 05:00:00 2022-05-05 10:00:00 1 2 0
3 3 C 7 2022-05-05 00:00:00 2022-05-05 04:00:00 2 0 1
4 4 A 3 2022-05-05 13:00:00 2022-05-05 16:00:00 2 0 1
5 5 C 8 2022-05-05 11:00:00 2022-05-05 12:00:00 1 1 1
我有一个数据框 df
记录人们从一个位置到另一个位置的移动。我有第二个数据框 df2
记录特定时间范围内的事件。对于 df
中的每个条目,如果事件 timestamp
发生在 start_ts
和 end_ts
之间,我想计算每个事件类型的事件数,因为 person_id
的比赛。
import pandas as pd
import numpy as np
current_time = '2022-05-05 17:00'
df = pd.DataFrame({"person_id": ["A", "B", "A", "C", "A", "C"],
"location_id": [1, 5, 2, 7, 3, 8],
"start_ts": pd.to_datetime(["2022-05-05 00:00", "2022-05-05 00:00", "2022-05-05 05:00", "2022-05-05 00:00", "2022-05-05 13:00", "2022-05-05 11:00"]),
"end_ts": pd.to_datetime(["2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 10:00", "2022-05-05 04:00", "2022-05-05 16:00", "2022-05-05 12:00"]),
})
df2 = pd.DataFrame({"person_id": ["A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "C"],
"timestamp": pd.to_datetime(["2022-05-05 01:00", "2022-05-05 01:10", "2022-05-05 01:30", "2022-05-05 06:00",
"2022-05-05 07:00", "2022-05-05 08:00", "2022-05-05 13:00", "2022-05-05 14:00",
"2022-05-05 15:00", "2022-05-05 01:00", "2022-05-05 01:30", "2022-05-05 02:00",
"2022-05-05 01:00", "2022-05-05 02:00", "2022-05-05 03:00", "2022-05-05 11:10",
"2022-05-05 11:20", "2022-05-05 11:30"]),
"event": ["1", "2", "3", "1", "2", "2", "3", "1", "1", "2", "3", "3", "1", "1", "3", "2", "3", "1"],
})
df
看起来如下:
print(df.to_string())
>>
person_id location_id start_ts end_ts
0 A 1 2022-05-05 00:00:00 2022-05-05 02:00:00
1 B 5 2022-05-05 00:00:00 2022-05-05 03:00:00
2 A 2 2022-05-05 05:00:00 2022-05-05 10:00:00
3 C 7 2022-05-05 00:00:00 2022-05-05 04:00:00
4 A 3 2022-05-05 13:00:00 2022-05-05 16:00:00
5 C 8 2022-05-05 11:00:00 2022-05-05 12:00:00
df2
看起来如下:
print(df2.to_string())
>>
person_id timestamp event
0 A 2022-05-05 01:00:00 1
1 A 2022-05-05 01:10:00 2
2 A 2022-05-05 01:30:00 3
3 A 2022-05-05 06:00:00 1
4 A 2022-05-05 07:00:00 2
5 A 2022-05-05 08:00:00 2
6 A 2022-05-05 13:00:00 3
7 A 2022-05-05 14:00:00 1
8 A 2022-05-05 15:00:00 1
9 B 2022-05-05 01:00:00 2
10 B 2022-05-05 01:30:00 3
11 B 2022-05-05 02:00:00 3
12 C 2022-05-05 01:00:00 1
13 C 2022-05-05 02:00:00 1
14 C 2022-05-05 03:00:00 3
15 C 2022-05-05 11:10:00 2
16 C 2022-05-05 11:20:00 3
17 C 2022-05-05 11:30:00 1
我尝试进行数据透视类型计数如下:
df2 = df2.sample(frac=1)
for i in df.index:
for j in np.unique(df2["event"]):
df.loc[i, "count_event_" + j] = len(df2.loc[(df2["timestamp"].between(df.loc[i, "start_ts"], df.loc[i, "end_ts"], inclusive = "both")) &
(df2["event"] == j) &
(df2["person_id"] == df.loc[i, "person_id"])])
我想要的输出是:
print(df.to_string())
>>
person_id location_id start_ts end_ts count_event_1 count_event_2 count_event_3
0 A 1 2022-05-05 00:00:00 2022-05-05 02:00:00 1.0 1.0 1.0
1 B 5 2022-05-05 00:00:00 2022-05-05 03:00:00 0.0 1.0 2.0
2 A 2 2022-05-05 05:00:00 2022-05-05 10:00:00 1.0 2.0 0.0
3 C 7 2022-05-05 00:00:00 2022-05-05 04:00:00 2.0 0.0 1.0
4 A 3 2022-05-05 13:00:00 2022-05-05 16:00:00 2.0 0.0 1.0
5 C 8 2022-05-05 11:00:00 2022-05-05 12:00:00 1.0 1.0 1.0
我想摆脱嵌套的 for 循环,而是使用 group by 或 pivot 获得相同的结果。我怎样才能有效地实现这一目标,特别是考虑到每个实际数据帧都有几百万个条目。
这是一种方法(我离开了中间 print-outs 以帮助更好地理解每一步发生的事情):
# Setup
df["start_ts"] = pd.to_datetime(df["start_ts"], format="%Y-%m-%d %H:%M:%s")
df["end_ts"] = pd.to_datetime(df["end_ts"], format="%Y-%m-%d %H:%M:%s")
df2["timestamp"] = pd.to_datetime(df2["timestamp"], format="%Y-%m-%d %H:%M:%s")
# Find matching intervals
for idx in df["person_id"].unique():
df2.loc[df2["person_id"] == idx, "interval"] = df2.loc[
df2["person_id"] == idx, "timestamp"
].map(
lambda x: df.loc[
(df["person_id"] == idx) & (df["start_ts"] <= x) & (x <= df["end_ts"]),
["start_ts", "end_ts"],
].index[0]
)
print(df2.head())
# Output
person_id timestamp event interval
0 A 2022-05-05 01:00:00 1 0.0
1 A 2022-05-05 01:10:00 2 0.0
2 A 2022-05-05 01:30:00 3 0.0
3 A 2022-05-05 06:00:00 1 2.0
4 A 2022-05-05 07:00:00 2 2.0
# Count number of events
df2 = (
df2.assign(interval=lambda x: x["interval"].astype(int))
.assign(count=1)
.groupby(["person_id", "interval", "event"])
.agg({"count": sum})
.reset_index(drop=False)
.pivot(index=["person_id", "interval"], columns="event")
.reset_index(drop=False)
)
df2.columns = df2.columns.droplevel()
df2.columns = [
"person_id",
"interval",
"count_event_1",
"count_event_2",
"count_event_3",
]
print(df2)
# Output
person_id interval 1 2 3
0 A 0 1.0 1.0 1.0
1 A 2 1.0 2.0 NaN
2 A 4 2.0 NaN 1.0
3 B 1 NaN 1.0 2.0
4 C 3 2.0 NaN 1.0
5 C 5 1.0 1.0 1.0
# Final dataframe
df = df.reset_index(drop=False).rename(columns={"index": "interval"})
df = (
pd.merge(left=df, right=df2, on=["person_id", "interval"])
.fillna(0)
.astype(int, errors="ignore")
)
print(df)
# Output as expected
interval person_id location_id start_ts end_ts count_event_1 count_event_2 count_event_3
0 0 A 1 2022-05-05 00:00:00 2022-05-05 02:00:00 1 1 1
1 1 B 5 2022-05-05 00:00:00 2022-05-05 03:00:00 0 1 2
2 2 A 2 2022-05-05 05:00:00 2022-05-05 10:00:00 1 2 0
3 3 C 7 2022-05-05 00:00:00 2022-05-05 04:00:00 2 0 1
4 4 A 3 2022-05-05 13:00:00 2022-05-05 16:00:00 2 0 1
5 5 C 8 2022-05-05 11:00:00 2022-05-05 12:00:00 1 1 1