Python 按一天的分钟数分组
Python Group by minutes in a day
我有超过 30 天的日志数据。我希望对数据进行分组,以查看 24 小时内 15 分钟 window 的事件总数最少。数据格式如下:
2021-04-2619:12:03,上传
2021-04-2611:32:03,下载
2021-04-2419:14:03,下载
2021-04-221:9:03,下载
2021-04-194:12:03,上传
2021-04-07 7:12:03,下载
我正在寻找类似
的结果
19:15:00, 2
11:55:00, 1
7:15:00, 1
4:15:00, 1
1:15:00, 1
目前,我使用石斑鱼:
df['date'] = pd.to_datetime(df['date'])
df.groupby(pd.Grouper(key="date",freq='.25H')).Host.count()
我的结果看起来像\
date
2021-04-08 16:15:00+00:00 1
2021-04-08 16:30:00+00:00 20
2021-04-08 16:45:00+00:00 6
2021-04-08 17:00:00+00:00 6
2021-04-08 17:15:00+00:00 0
..
2021-04-29 18:00:00+00:00 3
2021-04-29 18:15:00+00:00 9
2021-04-29 18:30:00+00:00 0
2021-04-29 18:45:00+00:00 3
2021-04-29 19:00:00+00:00 15
有什么办法可以让我不再只按时间再次合并而不包括日期吗?
假设您想在 5 分钟 window 内集合。为此,您需要提取时间戳列。让 df
是您的 pandas 数据框。对于时间戳中的每个时间,将该时间舍入到最接近 5 min
的倍数并添加到计数器映射中。请参阅下面的代码。
timestamp = df["timestamp"]
counter = collections.defaultdict(int)
def get_time(time):
hh, mm, ss = map(int, time.split(':'))
total_seconds = hh * 3600 + mm * 60 + ss
roundup_seconds = math.ceil(total_seconds / (5*60)) * (5*60)
# I suggest you to try out the above formula on paper for better understanding
# '5 min' means '5*60 sec' roundup
new_hh = roundup_seconds // 3600
roundup_seconds %= 3600
new_mm = roundup_seconds // 60
roundup_seconds %= 60
new_ss = roundup_seconds
return f"{new_hh}:{new_mm}:{new_ss}" # f-strings for python 3.6 and above
for time in timestamp:
counter[get_time(time)] += 1
# Now counter will carry counts of rounded time stamp
# I've tested locally and it's same as the output you mentioned.
# Let me know if you need any further help :)
一种方法是使用 TimeDelta 而不是 DateTime,因为比较只发生在小时和分钟之间,而不发生在日期之间。
import pandas as pd
import numpy as np
df = pd.DataFrame({'time': {0: '2021-04-26 19:12:03', 1: '2021-04-26 11:32:03',
2: '2021-04-24 19:14:03', 3: '2021-04-22 1:9:03',
4: '2021-04-19 4:12:03', 5: '2021-04-07 7:12:03'},
'event': {0: 'upload', 1: 'download', 2: 'download',
3: 'download', 4: 'upload', 5: 'download'}})
# Convert To TimeDelta (Ignore Day)
df['time'] = pd.to_timedelta(df['time'].str[-8:])
# Set TimeDelta as index
df = df.set_index('time')
# Get Count of events per 15 minute period
df = df.resample('.25H')['event'].count()
# Convert To Nearest 15 Minute Interval
ns15min = 15 * 60 * 1000000000 # 15 minutes in nanoseconds
df.index = pd.to_timedelta(((df.index.astype(np.int64) // ns15min + 1) * ns15min))
# Reset Index, Filter and Sort
df = df.reset_index()
df = df[df['event'] > 0]
df = df.sort_values(['event', 'time'], ascending=(False, False))
# Remove Day Part of Time Delta (Convert to str)
df['time'] = df['time'].astype(str).str[-8:]
# For Display
print(df.to_string(index=False))
过滤输出:
time event
19:15:00 2
21:00:00 1
11:30:00 1
07:15:00 1
04:15:00 1
你想要这样的东西吗?
这里的想法是——如果你不关心日期,那么你可以用一些随机日期替换所有日期,然后你可以 group/count 仅基于时间数据的数据。
df.Host = 1
df.date = df.date.str.replace( r'(\d{4}-\d{1,2}-\d{1,2})','2021-04-26', regex=True)
df.date = pd.to_datetime(df.date)
new_df = df.groupby(pd.Grouper(key='date',freq='.25H')).agg({'Host' : sum}).reset_index()
new_df = new_df.loc[new_df['Host']!=0]
new_df['date'] = new_df['date'].dt.time
我有超过 30 天的日志数据。我希望对数据进行分组,以查看 24 小时内 15 分钟 window 的事件总数最少。数据格式如下:
2021-04-2619:12:03,上传
2021-04-2611:32:03,下载
2021-04-2419:14:03,下载
2021-04-221:9:03,下载
2021-04-194:12:03,上传
2021-04-07 7:12:03,下载
我正在寻找类似
的结果19:15:00, 2
11:55:00, 1
7:15:00, 1
4:15:00, 1
1:15:00, 1
目前,我使用石斑鱼:
df['date'] = pd.to_datetime(df['date'])
df.groupby(pd.Grouper(key="date",freq='.25H')).Host.count()
我的结果看起来像\
date
2021-04-08 16:15:00+00:00 1
2021-04-08 16:30:00+00:00 20
2021-04-08 16:45:00+00:00 6
2021-04-08 17:00:00+00:00 6
2021-04-08 17:15:00+00:00 0
..
2021-04-29 18:00:00+00:00 3
2021-04-29 18:15:00+00:00 9
2021-04-29 18:30:00+00:00 0
2021-04-29 18:45:00+00:00 3
2021-04-29 19:00:00+00:00 15
有什么办法可以让我不再只按时间再次合并而不包括日期吗?
假设您想在 5 分钟 window 内集合。为此,您需要提取时间戳列。让 df
是您的 pandas 数据框。对于时间戳中的每个时间,将该时间舍入到最接近 5 min
的倍数并添加到计数器映射中。请参阅下面的代码。
timestamp = df["timestamp"]
counter = collections.defaultdict(int)
def get_time(time):
hh, mm, ss = map(int, time.split(':'))
total_seconds = hh * 3600 + mm * 60 + ss
roundup_seconds = math.ceil(total_seconds / (5*60)) * (5*60)
# I suggest you to try out the above formula on paper for better understanding
# '5 min' means '5*60 sec' roundup
new_hh = roundup_seconds // 3600
roundup_seconds %= 3600
new_mm = roundup_seconds // 60
roundup_seconds %= 60
new_ss = roundup_seconds
return f"{new_hh}:{new_mm}:{new_ss}" # f-strings for python 3.6 and above
for time in timestamp:
counter[get_time(time)] += 1
# Now counter will carry counts of rounded time stamp
# I've tested locally and it's same as the output you mentioned.
# Let me know if you need any further help :)
一种方法是使用 TimeDelta 而不是 DateTime,因为比较只发生在小时和分钟之间,而不发生在日期之间。
import pandas as pd
import numpy as np
df = pd.DataFrame({'time': {0: '2021-04-26 19:12:03', 1: '2021-04-26 11:32:03',
2: '2021-04-24 19:14:03', 3: '2021-04-22 1:9:03',
4: '2021-04-19 4:12:03', 5: '2021-04-07 7:12:03'},
'event': {0: 'upload', 1: 'download', 2: 'download',
3: 'download', 4: 'upload', 5: 'download'}})
# Convert To TimeDelta (Ignore Day)
df['time'] = pd.to_timedelta(df['time'].str[-8:])
# Set TimeDelta as index
df = df.set_index('time')
# Get Count of events per 15 minute period
df = df.resample('.25H')['event'].count()
# Convert To Nearest 15 Minute Interval
ns15min = 15 * 60 * 1000000000 # 15 minutes in nanoseconds
df.index = pd.to_timedelta(((df.index.astype(np.int64) // ns15min + 1) * ns15min))
# Reset Index, Filter and Sort
df = df.reset_index()
df = df[df['event'] > 0]
df = df.sort_values(['event', 'time'], ascending=(False, False))
# Remove Day Part of Time Delta (Convert to str)
df['time'] = df['time'].astype(str).str[-8:]
# For Display
print(df.to_string(index=False))
过滤输出:
time event 19:15:00 2 21:00:00 1 11:30:00 1 07:15:00 1 04:15:00 1
你想要这样的东西吗?
这里的想法是——如果你不关心日期,那么你可以用一些随机日期替换所有日期,然后你可以 group/count 仅基于时间数据的数据。
df.Host = 1
df.date = df.date.str.replace( r'(\d{4}-\d{1,2}-\d{1,2})','2021-04-26', regex=True)
df.date = pd.to_datetime(df.date)
new_df = df.groupby(pd.Grouper(key='date',freq='.25H')).agg({'Host' : sum}).reset_index()
new_df = new_df.loc[new_df['Host']!=0]
new_df['date'] = new_df['date'].dt.time