计算 Pandas 内每天和客户的使用时间
Calculate Usage time per day and customer in Pandas
我有一个 Pandas DataFrame,其中包含客户每月发送的事件,如下所示:
df = pd.DataFrame(
[
('2017-01-01 12:00:00', 'SID1', 'Something', 'A. Inc'),
('2017-01-02 00:30:00', 'SID1', 'Something', 'A. Inc'),
('2017-01-02 12:00:00', 'SID2', 'Something', 'A. Inc'),
('2017-01-01 15:00:00', 'SID4', 'Something', 'B. GmbH')
],
columns=['TimeStamp', 'Session ID', 'Event', 'Customer']
)
会话 ID 是唯一的,但可以跨越多天。此外,同一天可能会发生多个会话。
我想像这样计算每个客户一个月中每一天的使用分钟数。
Customer
01.01
02.01
...
31.01
A. Inc
720
30
...
50
B. GmbH
1
0
...
0
我怀疑,将时间戳拆分为日期和时间,然后是 groupby('Customer'、'Day'、'Session ID'),然后应用(通过 apply())一些数学是要走的路,但到目前为止我无法取得任何真正的进步。
你可以试试这个。
以分钟为单位提取日期和时间到新列。然后使用 groupby 和 agg 为客户和日期求和时间。然后最后旋转数据框。
df['TimeStamp']= df['TimeStamp'].apply(pd.to_datetime)
df['date'] = df['TimeStamp'].dt.date
df['minutes'] = df['TimeStamp'].dt.strftime('%H:%M').apply(lambda x: int(x.split(':')[0]) * 60 + int(x.split(':')[1]))
new_df = df.groupby(['Customer','date']).agg({'minutes': sum}).reset_index()
print(pd.pivot_table(new_df, values = 'minutes', index=['Customer'], columns = 'date'))
输出:
date 2017-01-01 2017-01-02
Customer
A. Inc 720.0 750.0
B. GmbH 900.0 NaN
好的,我找到了一个解决方案,可能不是最好的,但很有效。
# group by id and add max and min values of each group to new columns
group_Session = df.groupby(['Session ID'])
df['Start Time'] = group_Session['Timestamp'].transform(lambda x: x.min())
df['Stop Time'] = group_Session['Timestamp'].transform(lambda x: x.max())
df.drop_duplicates(subset=['Session ID'], keep='first', inplace = True)
# now we have start/stop for each session
# add all days of month to dataframe and fill with zeros
dateStart = datetime.datetime(2022, 2, 1)
dateStop = (dateStart + dateutil.relativedelta.relativedelta(day = 31))
for single_date in (dateStart.day + n for n in range(dateStop.day)):
df[str(single_date) + '.' + str(dateStart.month)] = 0
for index, row in df.iterrows():
# Create a dateRange from start to finisch with minute frequency
# Convert dateRange to Dataframe
dateRangeFrame = pd.date_range(start = row['Start Time'], end = row['Stop Time'], freq = 'T').to_frame(name = 'value')
# extract day from dateIndex
dateRangeFrame['day'] = dateRangeFrame['value'].dt.strftime('%#d.%#m')
#group by day and count the results -> now we have: per session(index) a day/minute object
day_to_minute_df = dateRangeFrame.groupby(['day']).count()
# for each group find column from index and add sum of val
for d2m_index, row in day_to_minute_df.iterrows():
df.loc[index, d2m_index] = row['value']
new_df = df.groupby(['Customer']).sum()
我有一个 Pandas DataFrame,其中包含客户每月发送的事件,如下所示:
df = pd.DataFrame(
[
('2017-01-01 12:00:00', 'SID1', 'Something', 'A. Inc'),
('2017-01-02 00:30:00', 'SID1', 'Something', 'A. Inc'),
('2017-01-02 12:00:00', 'SID2', 'Something', 'A. Inc'),
('2017-01-01 15:00:00', 'SID4', 'Something', 'B. GmbH')
],
columns=['TimeStamp', 'Session ID', 'Event', 'Customer']
)
会话 ID 是唯一的,但可以跨越多天。此外,同一天可能会发生多个会话。
我想像这样计算每个客户一个月中每一天的使用分钟数。
Customer | 01.01 | 02.01 | ... | 31.01 |
---|---|---|---|---|
A. Inc | 720 | 30 | ... | 50 |
B. GmbH | 1 | 0 | ... | 0 |
我怀疑,将时间戳拆分为日期和时间,然后是 groupby('Customer'、'Day'、'Session ID'),然后应用(通过 apply())一些数学是要走的路,但到目前为止我无法取得任何真正的进步。
你可以试试这个。 以分钟为单位提取日期和时间到新列。然后使用 groupby 和 agg 为客户和日期求和时间。然后最后旋转数据框。
df['TimeStamp']= df['TimeStamp'].apply(pd.to_datetime)
df['date'] = df['TimeStamp'].dt.date
df['minutes'] = df['TimeStamp'].dt.strftime('%H:%M').apply(lambda x: int(x.split(':')[0]) * 60 + int(x.split(':')[1]))
new_df = df.groupby(['Customer','date']).agg({'minutes': sum}).reset_index()
print(pd.pivot_table(new_df, values = 'minutes', index=['Customer'], columns = 'date'))
输出:
date 2017-01-01 2017-01-02
Customer
A. Inc 720.0 750.0
B. GmbH 900.0 NaN
好的,我找到了一个解决方案,可能不是最好的,但很有效。
# group by id and add max and min values of each group to new columns
group_Session = df.groupby(['Session ID'])
df['Start Time'] = group_Session['Timestamp'].transform(lambda x: x.min())
df['Stop Time'] = group_Session['Timestamp'].transform(lambda x: x.max())
df.drop_duplicates(subset=['Session ID'], keep='first', inplace = True)
# now we have start/stop for each session
# add all days of month to dataframe and fill with zeros
dateStart = datetime.datetime(2022, 2, 1)
dateStop = (dateStart + dateutil.relativedelta.relativedelta(day = 31))
for single_date in (dateStart.day + n for n in range(dateStop.day)):
df[str(single_date) + '.' + str(dateStart.month)] = 0
for index, row in df.iterrows():
# Create a dateRange from start to finisch with minute frequency
# Convert dateRange to Dataframe
dateRangeFrame = pd.date_range(start = row['Start Time'], end = row['Stop Time'], freq = 'T').to_frame(name = 'value')
# extract day from dateIndex
dateRangeFrame['day'] = dateRangeFrame['value'].dt.strftime('%#d.%#m')
#group by day and count the results -> now we have: per session(index) a day/minute object
day_to_minute_df = dateRangeFrame.groupby(['day']).count()
# for each group find column from index and add sum of val
for d2m_index, row in day_to_minute_df.iterrows():
df.loc[index, d2m_index] = row['value']
new_df = df.groupby(['Customer']).sum()