Pandas、datetime & logic:汇总一列中特定行的总和
Pandas, datetime & logic: Summarizing the sum of specific rows in a column
我得到了一个 DataFrame:
date start end inter
0 01-09-2020 10:01:24 10:02:13 0 days 00:00:49
1 01-09-2020 10:04:21 10:22:01 0 days 00:17:40
2 01-09-2020 10:16:14 10:25:06 0 days 00:08:52
3 01-09-2020 10:28:38 10:28:40 0 days 00:00:02
4 01-09-2020 10:37:38 10:37:41 0 days 00:00:03
... ... ... ... ...
995 17-09-2020 12:19:03 12:21:06 0 days 00:02:03
996 17-09-2020 12:22:53 12:22:58 0 days 00:00:05
997 17-09-2020 12:25:11 12:25:14 0 days 00:00:03
998 17-09-2020 12:27:07 12:27:08 0 days 00:00:01
999 17-09-2020 12:29:03 12:29:05 0 days 00:00:02
1000 rows × 4 columns
我想创建一个新的 df,但是 'inter' 的总和在特定的日期时间范围内。例如:
new_date start_range end_range inter_sum
0 01-09-2020 10:00:00 10:59:59 0 days 01:15:36
1 01-09-2020 11:00:00 11:59:59 0 days 00:58:30
...
997 17-09-2020 10:00:00 10:59:59 0 days 03:00:15
998 17-09-2020 11:00:00 11:59:59 0 days 00:47:20
其中 'inter_sum' 是 'start_range' 和 'end_range' 之间的 'inter' 值的总和,基于之前 df 的 'start' 和 'end' .
试试 resample
:
#convert date column to datetime if needed
df["date"] = pd.to_datetime(df["date"])
#convert time column to datetime for resample
df["start"] = pd.to_datetime(df["date"].astype(str)+"T"+df["start"].astype(str))
#resample on start datetime every hour and sum
output = df.resample("H", on="start")["inter"].sum()
#formatting to match expected output
output["date"] = output["start"].dt.date
output["end"] = (output["start"] + pd.Timedelta(minutes=59, seconds=59)).dt.time
output["start"] = output["start"].dt.time
output = output[["date", "start", "end", "inter_sum"]]
>>> output
date start end inter_sum
0 2020-01-09 10:00:00 10:59:59 0 days 00:27:26
1 2020-01-09 11:00:00 11:59:59 0 days 00:00:00
2 2020-01-09 12:00:00 12:59:59 0 days 00:00:00
3 2020-01-09 13:00:00 13:59:59 0 days 00:00:00
4 2020-01-09 14:00:00 14:59:59 0 days 00:00:00
... ... ... ...
6046 2020-09-17 08:00:00 08:59:59 0 days 00:00:00
6047 2020-09-17 09:00:00 09:59:59 0 days 00:00:00
6048 2020-09-17 10:00:00 10:59:59 0 days 00:00:00
6049 2020-09-17 11:00:00 11:59:59 0 days 00:00:00
6050 2020-09-17 12:00:00 12:59:59 0 days 00:02:14
我得到了一个 DataFrame:
date start end inter
0 01-09-2020 10:01:24 10:02:13 0 days 00:00:49
1 01-09-2020 10:04:21 10:22:01 0 days 00:17:40
2 01-09-2020 10:16:14 10:25:06 0 days 00:08:52
3 01-09-2020 10:28:38 10:28:40 0 days 00:00:02
4 01-09-2020 10:37:38 10:37:41 0 days 00:00:03
... ... ... ... ...
995 17-09-2020 12:19:03 12:21:06 0 days 00:02:03
996 17-09-2020 12:22:53 12:22:58 0 days 00:00:05
997 17-09-2020 12:25:11 12:25:14 0 days 00:00:03
998 17-09-2020 12:27:07 12:27:08 0 days 00:00:01
999 17-09-2020 12:29:03 12:29:05 0 days 00:00:02
1000 rows × 4 columns
我想创建一个新的 df,但是 'inter' 的总和在特定的日期时间范围内。例如:
new_date start_range end_range inter_sum
0 01-09-2020 10:00:00 10:59:59 0 days 01:15:36
1 01-09-2020 11:00:00 11:59:59 0 days 00:58:30
...
997 17-09-2020 10:00:00 10:59:59 0 days 03:00:15
998 17-09-2020 11:00:00 11:59:59 0 days 00:47:20
其中 'inter_sum' 是 'start_range' 和 'end_range' 之间的 'inter' 值的总和,基于之前 df 的 'start' 和 'end' .
试试 resample
:
#convert date column to datetime if needed
df["date"] = pd.to_datetime(df["date"])
#convert time column to datetime for resample
df["start"] = pd.to_datetime(df["date"].astype(str)+"T"+df["start"].astype(str))
#resample on start datetime every hour and sum
output = df.resample("H", on="start")["inter"].sum()
#formatting to match expected output
output["date"] = output["start"].dt.date
output["end"] = (output["start"] + pd.Timedelta(minutes=59, seconds=59)).dt.time
output["start"] = output["start"].dt.time
output = output[["date", "start", "end", "inter_sum"]]
>>> output
date start end inter_sum
0 2020-01-09 10:00:00 10:59:59 0 days 00:27:26
1 2020-01-09 11:00:00 11:59:59 0 days 00:00:00
2 2020-01-09 12:00:00 12:59:59 0 days 00:00:00
3 2020-01-09 13:00:00 13:59:59 0 days 00:00:00
4 2020-01-09 14:00:00 14:59:59 0 days 00:00:00
... ... ... ...
6046 2020-09-17 08:00:00 08:59:59 0 days 00:00:00
6047 2020-09-17 09:00:00 09:59:59 0 days 00:00:00
6048 2020-09-17 10:00:00 10:59:59 0 days 00:00:00
6049 2020-09-17 11:00:00 11:59:59 0 days 00:00:00
6050 2020-09-17 12:00:00 12:59:59 0 days 00:02:14