以 15 分钟为间隔安排来自 salesforce 的呼叫数据

Arranging call data from salesforce in 15 minute intervals

我是 python 和 pandas 的新手,也是 Whosebug 的新手,所以对于我提前犯的任何错误,我深表歉意。

我有这个数据框

df = pd.DataFrame(
    data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
          ['Donald Trump', 'German', '2021-9-23 14:58:01','2021-9-23 15:00:05', 124 ],
          ['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
          ['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
    columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
df['interval_start'] = pd.to_datetime(df['interval_start'])
df['interval_end'] = pd.to_datetime(df['interval_end'])

输出是

specialist  language    interval_start  interval_end    status_duration
0   Donald Trump    German  2021-09-23 14:28:00 2021-09-23 14:58:00 1800
1   Donald Trump    German  2021-09-23 14:58:01 2021-09-23 15:00:05 125
2   Donald Trump    German  2021-09-24 10:05:00 2021-09-24 10:15:30 630
3   Monica Lewinsky German  2021-09-24 10:05:00 2021-09-24 10:15:30 630

我想要的结果是像下面这样的东西

specialist  language    interval    status_duration
0   Donald Trump    German  2021-9-23 14:15:00  120
1   Donald Trump    German  2021-9-23 14:30:00  900
2   Donald Trump    German  2021-9-23 14:45:00  899
3   Donald Trump    German  2021-9-23 15:00:00  5
4   Donald Trump    German  2021-9-24 10:00:00  600
5   Donald Trump    German  2021-9-24 10:15:00  30
6   Monica Lewinsky German  2021-9-24 10:15:00  30

我有另一个主题的代码

ref = (df.groupby(["specialist", "Language", pd.Grouper(key="Interval Start", freq="D")], as_index=False)
         .agg(status_duration=("status_duration", lambda d: [*([900]*(d.iat[0]//900)), d.iat[0]%900]),
              Interval=("Interval Start", "first"))
         .explode("status_duration"))

ref["Interval"] = ref["Interval"].dt.floor("15min")+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit="sec")

但它没有考虑“interval_start”,我需要先检查 status_duration 是否会保持相同的 15 分钟间隔。希望有人能提供帮助,因为这对我来说是一个非常高级的问题,我已经研究了 10 多天。

不确定这是否不必要地令人费解,但它确实完成了工作。虽然可能有更好、更 pythonic 的方法...

我首先向 df 添加了一些新列,其中包含 status_duration 建议的结果间隔数、适合第一个间隔的分钟数和剩余的持续时间:

df['len'] = 1 + (df['status_duration']-1)//900

df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)

df['rest'] = df['status_duration'] - df['first']

然后,我们为每行添加一个额外的间隔,具有正休息和第一个切片 < 900:

df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])

现在,我通过使用 np.repeat() 复制行来创建新的数据框,以便根据间隔数和列表推导得到正确的数字来构建 interval_startstatus_duration 列使用 df.iterrows():

new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
                 'language': np.repeat(df['language'], df['len']),
                 'interval_start': [el for sublist in [[x['interval_start'] + timedelta(minutes=15*y) for y in range(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
                 'status_duration': [el for sublist in [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1 else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
})

然后我们将间隔开始时间四舍五入

new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')

现在剩下要做的就是分组和重置索引:

new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()

结果:

        specialist language      interval_start  status_duration
0     Donald Trump   German 2021-09-23 14:15:00              120
1     Donald Trump   German 2021-09-23 14:30:00              900
2     Donald Trump   German 2021-09-23 14:45:00              899
3     Donald Trump   German 2021-09-23 15:00:00                5
4     Donald Trump   German 2021-09-24 10:00:00              600
5     Donald Trump   German 2021-09-24 10:15:00               30
6  Monica Lewinsky   German 2021-09-24 10:00:00               30

仍然存在一个问题:最后的分组步骤可能导致 15 分钟的间隔,通过分组再次得到 status_duration > 900。

假设您输入数据的第二行有一个 interval_start 早 2 秒:

        specialist language      interval_start        interval_end  status_duration
0     Donald Trump   German 2021-09-23 14:28:00 2021-09-23 14:58:00             1800
1     Donald Trump   German 2021-09-23 14:57:59 2021-09-23 15:00:03              124
2     Donald Trump   German 2021-09-24 10:05:00 2021-09-24 10:15:30              630 
3  Monica Lewinsky   German 2021-09-24 10:05:00 2021-09-24 10:05:30               30 

那么分组后你会得到 status_duration901

        specialist language      interval_start  status_duration
0     Donald Trump   German 2021-09-23 14:15:00              120
1     Donald Trump   German 2021-09-23 14:30:00              900
2     Donald Trump   German 2021-09-23 14:45:00              901
3     Donald Trump   German 2021-09-23 15:00:00                3
4     Donald Trump   German 2021-09-24 10:00:00              600
5     Donald Trump   German 2021-09-24 10:15:00               30
6  Monica Lewinsky   German 2021-09-24 10:00:00               30

由于这种“溢出”可能会发生多次,因此情况变得复杂。一种方法是重复上述步骤,直到没有 new_dfstatus_duration > 900 剩余。这将继续溢出。

完整示例:

import pandas as pd
import numpy as np
from datetime import timedelta

input_df = pd.DataFrame(
    data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
          ['Donald Trump', 'German', '2021-9-23 14:57:59','2021-9-23 15:00:03', 124 ],
          ['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
          ['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
    columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
input_df['interval_start'] = pd.to_datetime(input_df['interval_start'])
input_df['interval_end'] = pd.to_datetime(input_df['interval_end'])

def build_df(df):
    while df['status_duration'].gt(900).any():
        df['len'] = 1 + (df['status_duration']-1)//900
        df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
        df['rest'] = df['status_duration'] - df['first']
        df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])
        new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
                 'language': np.repeat(df['language'], df['len']),
                 'interval_start': [el for sublist in [[x['interval_start'] + timedelta(minutes=15*y) for y in range(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
                 'status_duration': [el for sublist in [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1 else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
        })
        new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')
        new_df = new_df[new_df.status_duration != 0]
        new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()
        df = new_df.copy()
    return df

output_df = build_df(input_df)

结果:

        specialist language      interval_start  status_duration
0     Donald Trump   German 2021-09-23 14:15:00              120
1     Donald Trump   German 2021-09-23 14:30:00              900
2     Donald Trump   German 2021-09-23 14:45:00              900
3     Donald Trump   German 2021-09-23 15:00:00                4
4     Donald Trump   German 2021-09-24 10:00:00              600
5     Donald Trump   German 2021-09-24 10:15:00               30
6  Monica Lewinsky   German 2021-09-24 10:00:00               30

现在看来,我想可能应该有更简单的方法,但这就是我所得到的...

在学习更多之后,我使用 groupby()explode() 想出了另一个(更好的)解决方案。自从我的第一个答案以来,我将其添加为第二个答案,虽然可能有点复杂,但仍然有效,我也在这个答案中引用了它的一部分。


我首先添加了一些新列将 status_duration 分成第一个切片和其余部分,并将 status_duration 的原始值替换为相应的 2 元素列表:

df['first'] = ((df['interval_start']+ pd.Timedelta('1sec')).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['status_duration'] = df[['first','rest']].values.tolist()
df['status_duration'] = df['status_duration'].apply(lambda x: x if x[1] > 0 else [sum(x),0])

这将为您提供以下准备好的数据框:

        specialist language      interval_start  ... status_duration first  rest
0     Donald Trump   German 2021-09-23 14:28:00  ...     [120, 1680]   120  1680
1     Donald Trump   German 2021-09-23 14:58:01  ...        [119, 5]   119     5
2     Donald Trump   German 2021-09-24 10:05:00  ...       [600, 30]   600    30
3  Monica Lewinsky   German 2021-09-24 10:05:00  ...         [30, 0]   600  -570

在此基础上,您现在可以执行与您问题中的代码类似的 groupby()explode()。之后,您对间隔进行舍入并再次分组以合并现在由于 explode() 而具有多个条目的间隔。为了清理,我删除了持续时间 0 的行并重置索引:

ref = df.groupby(['specialist', 'language', pd.Grouper(key='interval_start', freq='T')], as_index=False)
        .agg(status_duration=('status_duration', lambda d: [d.iat[0][0],*([900]*(d.iat[0][1]//900)), d.iat[0][1]%900]),interval_start=('interval_start', 'first'))
        .explode('status_duration')

ref['interval_start'] = ref['interval_start'].dt.floor('15min')+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit='sec')

ref = ref.groupby(['specialist', 'language', 'interval_start']).sum()
ref = ref[ref.status_duration != 0].reset_index()

这会为您提供所需的输出:

        specialist language      interval_start  status_duration
0     Donald Trump   German 2021-09-23 14:15:00              120
1     Donald Trump   German 2021-09-23 14:30:00              900
2     Donald Trump   German 2021-09-23 14:45:00              899
3     Donald Trump   German 2021-09-23 15:00:00                5
4     Donald Trump   German 2021-09-24 10:00:00              600
5     Donald Trump   German 2021-09-24 10:15:00               30
6  Monica Lewinsky   German 2021-09-24 10:00:00               30

注意:我在另一个答案中描述的问题,即最后的分组步骤可能导致 status_duration > 900 对于真实数据应该是不可能的,因为专家不应该能够开始在第一个间隔结束之前的第二个间隔。所以这毕竟是你不需要处理的情况。