计算 Pandas 内每天和客户的使用时间

Question

我有一个 Pandas DataFrame，其中包含客户每月发送的事件，如下所示：

    df = pd.DataFrame(
[
    ('2017-01-01 12:00:00', 'SID1', 'Something', 'A. Inc'),
    ('2017-01-02 00:30:00', 'SID1', 'Something', 'A. Inc'),
    ('2017-01-02 12:00:00', 'SID2', 'Something', 'A. Inc'),
    ('2017-01-01 15:00:00', 'SID4', 'Something', 'B. GmbH')
],
    columns=['TimeStamp', 'Session ID', 'Event', 'Customer']
)

会话 ID 是唯一的，但可以跨越多天。此外，同一天可能会发生多个会话。

我想像这样计算每个客户一个月中每一天的使用分钟数。

Customer	01.01	02.01	...	31.01
A. Inc	720	30	...	50
B. GmbH	1	0	...	0

我怀疑，将时间戳拆分为日期和时间，然后是 groupby('Customer'、'Day'、'Session ID')，然后应用（通过 apply()）一些数学是要走的路，但到目前为止我无法取得任何真正的进步。

Answer 1

你可以试试这个。以分钟为单位提取日期和时间到新列。然后使用 groupby 和 agg 为客户和日期求和时间。然后最后旋转数据框。

df['TimeStamp']= df['TimeStamp'].apply(pd.to_datetime)
df['date'] = df['TimeStamp'].dt.date
df['minutes'] = df['TimeStamp'].dt.strftime('%H:%M').apply(lambda x: int(x.split(':')[0]) * 60 + int(x.split(':')[1]))

new_df = df.groupby(['Customer','date']).agg({'minutes': sum}).reset_index()
print(pd.pivot_table(new_df, values = 'minutes', index=['Customer'], columns = 'date'))

输出：

date      2017-01-01  2017-01-02
Customer                        
A. Inc         720.0       750.0
B. GmbH        900.0         NaN

Answer 2

好的，我找到了一个解决方案，可能不是最好的，但很有效。

# group by id and add max and min values of each group to new columns
group_Session = df.groupby(['Session ID']) 
df['Start Time'] = group_Session['Timestamp'].transform(lambda x: x.min())
df['Stop Time'] = group_Session['Timestamp'].transform(lambda x: x.max())
df.drop_duplicates(subset=['Session ID'], keep='first', inplace = True)
# now we have start/stop for each session
# add all days of month to dataframe and fill with zeros
dateStart = datetime.datetime(2022, 2, 1)
dateStop = (dateStart + dateutil.relativedelta.relativedelta(day = 31))
for single_date in (dateStart.day + n for n in range(dateStop.day)):
    df[str(single_date) + '.' + str(dateStart.month)] = 0

for index, row in df.iterrows(): 
    # Create a dateRange from start to finisch with minute frequency
    # Convert dateRange to Dataframe
    dateRangeFrame = pd.date_range(start = row['Start Time'], end = row['Stop Time'], freq = 'T').to_frame(name = 'value')
    # extract day from dateIndex
    dateRangeFrame['day'] = dateRangeFrame['value'].dt.strftime('%#d.%#m')
    #group by day and count the results -> now we have: per session(index) a day/minute object  
    day_to_minute_df = dateRangeFrame.groupby(['day']).count()
    # for each group find column from index and add sum of val
    for d2m_index, row in day_to_minute_df.iterrows():
        df.loc[index, d2m_index] = row['value']
new_df = df.groupby(['Customer']).sum()

计算 Pandas 内每天和客户的使用时间

Calculate Usage time per day and customer in Pandas

python

pandas

data-science