以 15 分钟为间隔安排来自 salesforce 的呼叫数据
Arranging call data from salesforce in 15 minute intervals
我是 python 和 pandas 的新手,也是 Whosebug 的新手,所以对于我提前犯的任何错误,我深表歉意。
我有这个数据框
df = pd.DataFrame(
data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
['Donald Trump', 'German', '2021-9-23 14:58:01','2021-9-23 15:00:05', 124 ],
['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
df['interval_start'] = pd.to_datetime(df['interval_start'])
df['interval_end'] = pd.to_datetime(df['interval_end'])
输出是
specialist language interval_start interval_end status_duration
0 Donald Trump German 2021-09-23 14:28:00 2021-09-23 14:58:00 1800
1 Donald Trump German 2021-09-23 14:58:01 2021-09-23 15:00:05 125
2 Donald Trump German 2021-09-24 10:05:00 2021-09-24 10:15:30 630
3 Monica Lewinsky German 2021-09-24 10:05:00 2021-09-24 10:15:30 630
我想要的结果是像下面这样的东西
specialist language interval status_duration
0 Donald Trump German 2021-9-23 14:15:00 120
1 Donald Trump German 2021-9-23 14:30:00 900
2 Donald Trump German 2021-9-23 14:45:00 899
3 Donald Trump German 2021-9-23 15:00:00 5
4 Donald Trump German 2021-9-24 10:00:00 600
5 Donald Trump German 2021-9-24 10:15:00 30
6 Monica Lewinsky German 2021-9-24 10:15:00 30
我有另一个主题的代码
ref = (df.groupby(["specialist", "Language", pd.Grouper(key="Interval Start", freq="D")], as_index=False)
.agg(status_duration=("status_duration", lambda d: [*([900]*(d.iat[0]//900)), d.iat[0]%900]),
Interval=("Interval Start", "first"))
.explode("status_duration"))
ref["Interval"] = ref["Interval"].dt.floor("15min")+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit="sec")
但它没有考虑“interval_start”,我需要先检查 status_duration 是否会保持相同的 15 分钟间隔。希望有人能提供帮助,因为这对我来说是一个非常高级的问题,我已经研究了 10 多天。
不确定这是否不必要地令人费解,但它确实完成了工作。虽然可能有更好、更 pythonic 的方法...
我首先向 df 添加了一些新列,其中包含 status_duration
建议的结果间隔数、适合第一个间隔的分钟数和剩余的持续时间:
df['len'] = 1 + (df['status_duration']-1)//900
df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
然后,我们为每行添加一个额外的间隔,具有正休息和第一个切片 < 900:
df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])
现在,我通过使用 np.repeat()
复制行来创建新的数据框,以便根据间隔数和列表推导得到正确的数字来构建 interval_start
和 status_duration
列使用 df.iterrows()
:
new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
'language': np.repeat(df['language'], df['len']),
'interval_start': [el for sublist in [[x['interval_start'] + timedelta(minutes=15*y) for y in range(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
'status_duration': [el for sublist in [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1 else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
})
然后我们将间隔开始时间四舍五入
new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')
现在剩下要做的就是分组和重置索引:
new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()
结果:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 899
3 Donald Trump German 2021-09-23 15:00:00 5
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
仍然存在一个问题:最后的分组步骤可能导致 15 分钟的间隔,通过分组再次得到 status_duration
> 900。
假设您输入数据的第二行有一个 interval_start
早 2 秒:
specialist language interval_start interval_end status_duration
0 Donald Trump German 2021-09-23 14:28:00 2021-09-23 14:58:00 1800
1 Donald Trump German 2021-09-23 14:57:59 2021-09-23 15:00:03 124
2 Donald Trump German 2021-09-24 10:05:00 2021-09-24 10:15:30 630
3 Monica Lewinsky German 2021-09-24 10:05:00 2021-09-24 10:05:30 30
那么分组后你会得到 status_duration
的 901
:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 901
3 Donald Trump German 2021-09-23 15:00:00 3
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
由于这种“溢出”可能会发生多次,因此情况变得复杂。一种方法是重复上述步骤,直到没有 new_df
行 status_duration
> 900 剩余。这将继续溢出。
完整示例:
import pandas as pd
import numpy as np
from datetime import timedelta
input_df = pd.DataFrame(
data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
['Donald Trump', 'German', '2021-9-23 14:57:59','2021-9-23 15:00:03', 124 ],
['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
input_df['interval_start'] = pd.to_datetime(input_df['interval_start'])
input_df['interval_end'] = pd.to_datetime(input_df['interval_end'])
def build_df(df):
while df['status_duration'].gt(900).any():
df['len'] = 1 + (df['status_duration']-1)//900
df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])
new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
'language': np.repeat(df['language'], df['len']),
'interval_start': [el for sublist in [[x['interval_start'] + timedelta(minutes=15*y) for y in range(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
'status_duration': [el for sublist in [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1 else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
})
new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')
new_df = new_df[new_df.status_duration != 0]
new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()
df = new_df.copy()
return df
output_df = build_df(input_df)
结果:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 900
3 Donald Trump German 2021-09-23 15:00:00 4
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
现在看来,我想可能应该有更简单的方法,但这就是我所得到的...
在学习更多之后,我使用 groupby()
和 explode()
想出了另一个(更好的)解决方案。自从我的第一个答案以来,我将其添加为第二个答案,虽然可能有点复杂,但仍然有效,我也在这个答案中引用了它的一部分。
我首先添加了一些新列将 status_duration
分成第一个切片和其余部分,并将 status_duration
的原始值替换为相应的 2 元素列表:
df['first'] = ((df['interval_start']+ pd.Timedelta('1sec')).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['status_duration'] = df[['first','rest']].values.tolist()
df['status_duration'] = df['status_duration'].apply(lambda x: x if x[1] > 0 else [sum(x),0])
这将为您提供以下准备好的数据框:
specialist language interval_start ... status_duration first rest
0 Donald Trump German 2021-09-23 14:28:00 ... [120, 1680] 120 1680
1 Donald Trump German 2021-09-23 14:58:01 ... [119, 5] 119 5
2 Donald Trump German 2021-09-24 10:05:00 ... [600, 30] 600 30
3 Monica Lewinsky German 2021-09-24 10:05:00 ... [30, 0] 600 -570
在此基础上,您现在可以执行与您问题中的代码类似的 groupby()
和 explode()
。之后,您对间隔进行舍入并再次分组以合并现在由于 explode()
而具有多个条目的间隔。为了清理,我删除了持续时间 0
的行并重置索引:
ref = df.groupby(['specialist', 'language', pd.Grouper(key='interval_start', freq='T')], as_index=False)
.agg(status_duration=('status_duration', lambda d: [d.iat[0][0],*([900]*(d.iat[0][1]//900)), d.iat[0][1]%900]),interval_start=('interval_start', 'first'))
.explode('status_duration')
ref['interval_start'] = ref['interval_start'].dt.floor('15min')+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit='sec')
ref = ref.groupby(['specialist', 'language', 'interval_start']).sum()
ref = ref[ref.status_duration != 0].reset_index()
这会为您提供所需的输出:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 899
3 Donald Trump German 2021-09-23 15:00:00 5
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
注意:我在另一个答案中描述的问题,即最后的分组步骤可能导致 status_duration
> 900 对于真实数据应该是不可能的,因为专家不应该能够开始在第一个间隔结束之前的第二个间隔。所以这毕竟是你不需要处理的情况。
我是 python 和 pandas 的新手,也是 Whosebug 的新手,所以对于我提前犯的任何错误,我深表歉意。
我有这个数据框
df = pd.DataFrame(
data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
['Donald Trump', 'German', '2021-9-23 14:58:01','2021-9-23 15:00:05', 124 ],
['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
df['interval_start'] = pd.to_datetime(df['interval_start'])
df['interval_end'] = pd.to_datetime(df['interval_end'])
输出是
specialist language interval_start interval_end status_duration
0 Donald Trump German 2021-09-23 14:28:00 2021-09-23 14:58:00 1800
1 Donald Trump German 2021-09-23 14:58:01 2021-09-23 15:00:05 125
2 Donald Trump German 2021-09-24 10:05:00 2021-09-24 10:15:30 630
3 Monica Lewinsky German 2021-09-24 10:05:00 2021-09-24 10:15:30 630
我想要的结果是像下面这样的东西
specialist language interval status_duration
0 Donald Trump German 2021-9-23 14:15:00 120
1 Donald Trump German 2021-9-23 14:30:00 900
2 Donald Trump German 2021-9-23 14:45:00 899
3 Donald Trump German 2021-9-23 15:00:00 5
4 Donald Trump German 2021-9-24 10:00:00 600
5 Donald Trump German 2021-9-24 10:15:00 30
6 Monica Lewinsky German 2021-9-24 10:15:00 30
我有另一个主题的代码
ref = (df.groupby(["specialist", "Language", pd.Grouper(key="Interval Start", freq="D")], as_index=False)
.agg(status_duration=("status_duration", lambda d: [*([900]*(d.iat[0]//900)), d.iat[0]%900]),
Interval=("Interval Start", "first"))
.explode("status_duration"))
ref["Interval"] = ref["Interval"].dt.floor("15min")+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit="sec")
但它没有考虑“interval_start”,我需要先检查 status_duration 是否会保持相同的 15 分钟间隔。希望有人能提供帮助,因为这对我来说是一个非常高级的问题,我已经研究了 10 多天。
不确定这是否不必要地令人费解,但它确实完成了工作。虽然可能有更好、更 pythonic 的方法...
我首先向 df 添加了一些新列,其中包含 status_duration
建议的结果间隔数、适合第一个间隔的分钟数和剩余的持续时间:
df['len'] = 1 + (df['status_duration']-1)//900
df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
然后,我们为每行添加一个额外的间隔,具有正休息和第一个切片 < 900:
df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])
现在,我通过使用 np.repeat()
复制行来创建新的数据框,以便根据间隔数和列表推导得到正确的数字来构建 interval_start
和 status_duration
列使用 df.iterrows()
:
new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
'language': np.repeat(df['language'], df['len']),
'interval_start': [el for sublist in [[x['interval_start'] + timedelta(minutes=15*y) for y in range(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
'status_duration': [el for sublist in [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1 else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
})
然后我们将间隔开始时间四舍五入
new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')
现在剩下要做的就是分组和重置索引:
new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()
结果:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 899
3 Donald Trump German 2021-09-23 15:00:00 5
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
仍然存在一个问题:最后的分组步骤可能导致 15 分钟的间隔,通过分组再次得到 status_duration
> 900。
假设您输入数据的第二行有一个 interval_start
早 2 秒:
specialist language interval_start interval_end status_duration
0 Donald Trump German 2021-09-23 14:28:00 2021-09-23 14:58:00 1800
1 Donald Trump German 2021-09-23 14:57:59 2021-09-23 15:00:03 124
2 Donald Trump German 2021-09-24 10:05:00 2021-09-24 10:15:30 630
3 Monica Lewinsky German 2021-09-24 10:05:00 2021-09-24 10:05:30 30
那么分组后你会得到 status_duration
的 901
:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 901
3 Donald Trump German 2021-09-23 15:00:00 3
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
由于这种“溢出”可能会发生多次,因此情况变得复杂。一种方法是重复上述步骤,直到没有 new_df
行 status_duration
> 900 剩余。这将继续溢出。
完整示例:
import pandas as pd
import numpy as np
from datetime import timedelta
input_df = pd.DataFrame(
data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
['Donald Trump', 'German', '2021-9-23 14:57:59','2021-9-23 15:00:03', 124 ],
['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
input_df['interval_start'] = pd.to_datetime(input_df['interval_start'])
input_df['interval_end'] = pd.to_datetime(input_df['interval_end'])
def build_df(df):
while df['status_duration'].gt(900).any():
df['len'] = 1 + (df['status_duration']-1)//900
df['first'] = ((df['interval_start']+timedelta(seconds=1)).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['len'] = np.where((df['rest'] > 0) & (df['first'] < 900), df['len'] + 1, df['len'])
new_df = pd.DataFrame({'specialist': np.repeat(df['specialist'], df['len']),
'language': np.repeat(df['language'], df['len']),
'interval_start': [el for sublist in [[x['interval_start'] + timedelta(minutes=15*y) for y in range(0, x['len'])] if (x['len'] > 1) else [x['interval_start']] for i, x in df.iterrows()] for el in sublist],
'status_duration': [el for sublist in [([x['first']]+[900]*(x['len']-2)+[x['rest']%900]) if x['len'] > 1 else [x['status_duration']] for i, x in df.iterrows()] for el in sublist]
})
new_df['interval_start'] = new_df['interval_start'].dt.floor('15min')
new_df = new_df[new_df.status_duration != 0]
new_df = new_df.groupby(['specialist', 'language', 'interval_start']).sum().reset_index()
df = new_df.copy()
return df
output_df = build_df(input_df)
结果:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 900
3 Donald Trump German 2021-09-23 15:00:00 4
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
现在看来,我想可能应该有更简单的方法,但这就是我所得到的...
在学习更多之后,我使用 groupby()
和 explode()
想出了另一个(更好的)解决方案。自从我的第一个答案以来,我将其添加为第二个答案,虽然可能有点复杂,但仍然有效,我也在这个答案中引用了它的一部分。
我首先添加了一些新列将 status_duration
分成第一个切片和其余部分,并将 status_duration
的原始值替换为相应的 2 元素列表:
df['first'] = ((df['interval_start']+ pd.Timedelta('1sec')).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['status_duration'] = df[['first','rest']].values.tolist()
df['status_duration'] = df['status_duration'].apply(lambda x: x if x[1] > 0 else [sum(x),0])
这将为您提供以下准备好的数据框:
specialist language interval_start ... status_duration first rest
0 Donald Trump German 2021-09-23 14:28:00 ... [120, 1680] 120 1680
1 Donald Trump German 2021-09-23 14:58:01 ... [119, 5] 119 5
2 Donald Trump German 2021-09-24 10:05:00 ... [600, 30] 600 30
3 Monica Lewinsky German 2021-09-24 10:05:00 ... [30, 0] 600 -570
在此基础上,您现在可以执行与您问题中的代码类似的 groupby()
和 explode()
。之后,您对间隔进行舍入并再次分组以合并现在由于 explode()
而具有多个条目的间隔。为了清理,我删除了持续时间 0
的行并重置索引:
ref = df.groupby(['specialist', 'language', pd.Grouper(key='interval_start', freq='T')], as_index=False)
.agg(status_duration=('status_duration', lambda d: [d.iat[0][0],*([900]*(d.iat[0][1]//900)), d.iat[0][1]%900]),interval_start=('interval_start', 'first'))
.explode('status_duration')
ref['interval_start'] = ref['interval_start'].dt.floor('15min')+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit='sec')
ref = ref.groupby(['specialist', 'language', 'interval_start']).sum()
ref = ref[ref.status_duration != 0].reset_index()
这会为您提供所需的输出:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 899
3 Donald Trump German 2021-09-23 15:00:00 5
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
注意:我在另一个答案中描述的问题,即最后的分组步骤可能导致 status_duration
> 900 对于真实数据应该是不可能的,因为专家不应该能够开始在第一个间隔结束之前的第二个间隔。所以这毕竟是你不需要处理的情况。