Python 按一天的分钟数分组

Python Group by minutes in a day

我有超过 30 天的日志数据。我希望对数据进行分组,以查看 24 小时内 15 分钟 window 的事件总数最少。数据格式如下:

2021-04-2619:12:03,上传
2021-04-2611:32:03,下载
2021-04-2419:14:03,下载
2021-04-221:9:03,下载
2021-04-194:12:03,上传
2021-04-07 7:12:03,下载

我正在寻找类似

的结果

19:15:00, 2
11:55:00, 1
7:15:00, 1
4:15:00, 1
1:15:00, 1

目前,我使用石斑鱼:

df['date'] = pd.to_datetime(df['date'])
df.groupby(pd.Grouper(key="date",freq='.25H')).Host.count()

我的结果看起来像\

date
2021-04-08 16:15:00+00:00     1
2021-04-08 16:30:00+00:00    20
2021-04-08 16:45:00+00:00     6
2021-04-08 17:00:00+00:00     6
2021-04-08 17:15:00+00:00     0
                             ..
2021-04-29 18:00:00+00:00     3
2021-04-29 18:15:00+00:00     9
2021-04-29 18:30:00+00:00     0
2021-04-29 18:45:00+00:00     3
2021-04-29 19:00:00+00:00    15

有什么办法可以让我不再只按时间再次合并而不包括日期吗?

假设您想在 5 分钟 window 内集合。为此,您需要提取时间戳列。让 df 是您的 pandas 数据框。对于时间戳中的每个时间,将该时间舍入到最接近 5 min 的倍数并添加到计数器映射中。请参阅下面的代码。

timestamp = df["timestamp"]
counter = collections.defaultdict(int)

def get_time(time):
    hh, mm, ss = map(int, time.split(':'))
    total_seconds = hh * 3600 + mm * 60 + ss
    roundup_seconds = math.ceil(total_seconds / (5*60)) * (5*60) 
    # I suggest you to try out the above formula on paper for better understanding
    # '5 min' means '5*60 sec' roundup
    new_hh = roundup_seconds // 3600
    roundup_seconds %= 3600
    new_mm = roundup_seconds // 60
    roundup_seconds %= 60
    new_ss = roundup_seconds
    return f"{new_hh}:{new_mm}:{new_ss}"  # f-strings for python 3.6 and above

for time in timestamp:
    counter[get_time(time)] += 1

# Now counter will carry counts of rounded time stamp
# I've tested locally and it's same as the output you mentioned. 
# Let me know if you need any further help :)

一种方法是使用 TimeDelta 而不是 DateTime,因为比较只发生在小时和分钟之间,而不发生在日期之间。

import pandas as pd
import numpy as np

df = pd.DataFrame({'time': {0: '2021-04-26 19:12:03', 1: '2021-04-26 11:32:03',
                            2: '2021-04-24 19:14:03', 3: '2021-04-22 1:9:03',
                            4: '2021-04-19 4:12:03', 5: '2021-04-07 7:12:03'},
                   'event': {0: 'upload', 1: 'download', 2: 'download',
                             3: 'download', 4: 'upload', 5: 'download'}})

# Convert To TimeDelta (Ignore Day)
df['time'] = pd.to_timedelta(df['time'].str[-8:])

# Set TimeDelta as index
df = df.set_index('time')
# Get Count of events per 15 minute period
df = df.resample('.25H')['event'].count()

# Convert To Nearest 15 Minute Interval
ns15min = 15 * 60 * 1000000000  # 15 minutes in nanoseconds
df.index = pd.to_timedelta(((df.index.astype(np.int64) // ns15min + 1) * ns15min))

# Reset Index, Filter and Sort
df = df.reset_index()
df = df[df['event'] > 0]
df = df.sort_values(['event', 'time'], ascending=(False, False))
# Remove Day Part of Time Delta (Convert to str)
df['time'] = df['time'].astype(str).str[-8:]

# For Display
print(df.to_string(index=False))

过滤输出:

    time  event
19:15:00      2
21:00:00      1
11:30:00      1
07:15:00      1
04:15:00      1

你想要这样的东西吗?

这里的想法是——如果你不关心日期,那么你可以用一些随机日期替换所有日期,然后你可以 group/count 仅基于时间数据的数据。

df.Host = 1 
df.date = df.date.str.replace( r'(\d{4}-\d{1,2}-\d{1,2})','2021-04-26', regex=True)
df.date = pd.to_datetime(df.date)
new_df = df.groupby(pd.Grouper(key='date',freq='.25H')).agg({'Host' : sum}).reset_index()
new_df = new_df.loc[new_df['Host']!=0]
new_df['date'] = new_df['date'].dt.time