Pandas 聚合自写函数:优化问题
Pandas aggregate with self written function: optimisation issue
以下代码完全符合我的需要,但是在处理大量数据(最多 100 000)时速度非常慢。
如何改进?
df = pd.DataFrame({
"session":["s1","s1","s1","s1","s2","s2","s2"],
"sub session":["a", "b", "d", "g", "f", "a", "x"],
"time":["2022-01-04 10:00:00", "2022-01-04 10:01:00", "2022-01-04 10:10:00", "2022-01-04 10:12:00",
"2022-01-04 15:25:00", "2022-01-04 15:30:00", "2022-01-04 15:45:00"]
})
print(df)
session sub session time
0 s1 a 2022-01-04 10:00:00
1 s1 b 2022-01-04 10:01:00
2 s1 d 2022-01-04 10:10:00
3 s1 g 2022-01-04 10:12:00
4 s2 f 2022-01-04 15:25:00
5 s2 a 2022-01-04 15:30:00
6 s2 x 2022-01-04 15:45:00
def func(serie):
arr = serie.to_list()
t0 = pd.to_datetime(str(arr[0]))
return [(pd.to_datetime(str(i))-t0).total_seconds()/60 for i in arr]
res = df.groupby(['session']).agg(
sub_session_path=("sub session", list),
path_length=("sub session", 'count'),
session_time=("time", func))
print(res)
sub_session_path path_length session_time
session
s1 [a, b, d, g] 4 [0.0, 1.0, 10.0, 12.0]
s2 [f, a, x] 3 [0.0, 5.0, 20.0]
IIUC,仅将时间列初始化为日期时间一次,并在函数中使用矢量代码:
df['time'] = pd.to_datetime(df['time'])
def func(s):
return (s-s.iloc[0]).dt.total_seconds().div(60).round(2).to_list()
res = df.groupby(['session']).agg(
sub_session_path=("sub session", list),
path_length=("sub session", 'count'),
session_time=("time", func))
输出:
sub_session_path path_length session_time
session
s1 [a, b, d, g] 4 [0.0, 1.0, 10.0, 12.0]
s2 [f, a, x] 3 [0.0, 5.0, 20.0]
以下代码完全符合我的需要,但是在处理大量数据(最多 100 000)时速度非常慢。 如何改进?
df = pd.DataFrame({
"session":["s1","s1","s1","s1","s2","s2","s2"],
"sub session":["a", "b", "d", "g", "f", "a", "x"],
"time":["2022-01-04 10:00:00", "2022-01-04 10:01:00", "2022-01-04 10:10:00", "2022-01-04 10:12:00",
"2022-01-04 15:25:00", "2022-01-04 15:30:00", "2022-01-04 15:45:00"]
})
print(df)
session sub session time
0 s1 a 2022-01-04 10:00:00
1 s1 b 2022-01-04 10:01:00
2 s1 d 2022-01-04 10:10:00
3 s1 g 2022-01-04 10:12:00
4 s2 f 2022-01-04 15:25:00
5 s2 a 2022-01-04 15:30:00
6 s2 x 2022-01-04 15:45:00
def func(serie):
arr = serie.to_list()
t0 = pd.to_datetime(str(arr[0]))
return [(pd.to_datetime(str(i))-t0).total_seconds()/60 for i in arr]
res = df.groupby(['session']).agg(
sub_session_path=("sub session", list),
path_length=("sub session", 'count'),
session_time=("time", func))
print(res)
sub_session_path path_length session_time
session
s1 [a, b, d, g] 4 [0.0, 1.0, 10.0, 12.0]
s2 [f, a, x] 3 [0.0, 5.0, 20.0]
IIUC,仅将时间列初始化为日期时间一次,并在函数中使用矢量代码:
df['time'] = pd.to_datetime(df['time'])
def func(s):
return (s-s.iloc[0]).dt.total_seconds().div(60).round(2).to_list()
res = df.groupby(['session']).agg(
sub_session_path=("sub session", list),
path_length=("sub session", 'count'),
session_time=("time", func))
输出:
sub_session_path path_length session_time
session
s1 [a, b, d, g] 4 [0.0, 1.0, 10.0, 12.0]
s2 [f, a, x] 3 [0.0, 5.0, 20.0]