pandas: 快速自定义聚合
pandas: fast custom aggregation
我有一个必须重新采样的时间索引数据:
interval = pd.Timedelta(1/8, "s")
resampled_df = df[["T", "N"]].resample(interval).max()
它工作得非常快,但我需要自定义聚合函数(极端)而不是 max
def extreme_agg(array_like):
# return max or min - which absolute value is greater
return max(array_like.max(), array_like.min(), key=abs)
interval = pd.Timedelta(1/8, "s")
resampled_df = df[["T", "N"]].resample(interval).apply(extreme_agg)
我也试过了
resampled_df = df[["T", "N"]].resample(interval).agg(extreme_agg)
但这两种方式都非常慢。
你知道如何让它更快吗?
或者有没有我的 extreme_agg
的快速等价物?
您可以使用更改 this function working with selected minimal and maximal values by DataFrame.xs
,首先是 min
和 max
的合计值:
np.random.seed(2021)
N = 10000
df = pd.DataFrame({'T':np.random.randint(100, size=N),
'N':np.random.randint(100, size=N)},
index=pd.timedelta_range(0, freq='100ms', periods=N)).sub(50)
# print (df)
def npwhere(df):
interval = pd.Timedelta(1/8, "s")
resampled_df = df[["T", "N"]].resample(interval).agg(['max','min'])
amax = resampled_df.xs('max', axis=1, level=1)
amin = resampled_df.xs('min', axis=1, level=1)
return pd.DataFrame(np.where(-amin > amax, amin, amax),
index=resampled_df.index,
columns=['T','N'])
resampled_df = npwhere(df)
print (resampled_df.head(10))
def extreme_agg(array_like):
# return max or min - which absolute value is greater
return max(array_like.max(), array_like.min(), key=abs)
interval = pd.Timedelta(1/8, "s")
resampled_df1 = df[["T", "N"]].resample(interval).agg(extreme_agg)
print (resampled_df1.head(10))
print (resampled_df.equals(resampled_df1))
True
In [206]: %timeit npwhere(df)
12.4 ms ± 46.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [207]: %timeit df[["T", "N"]].resample(interval).agg(lambda x: max(x, key = abs))
306 ms ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [208]: %timeit df[["T", "N"]].resample(interval).agg(extreme_agg)
2.29 s ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
这应该有效:
resampled_df = df[["T", "N"]].resample(interval).agg(lambda x: max(x, key = abs))
我有一个必须重新采样的时间索引数据:
interval = pd.Timedelta(1/8, "s")
resampled_df = df[["T", "N"]].resample(interval).max()
它工作得非常快,但我需要自定义聚合函数(极端)而不是 max
def extreme_agg(array_like):
# return max or min - which absolute value is greater
return max(array_like.max(), array_like.min(), key=abs)
interval = pd.Timedelta(1/8, "s")
resampled_df = df[["T", "N"]].resample(interval).apply(extreme_agg)
我也试过了
resampled_df = df[["T", "N"]].resample(interval).agg(extreme_agg)
但这两种方式都非常慢。 你知道如何让它更快吗?
或者有没有我的 extreme_agg
的快速等价物?
您可以使用更改 this function working with selected minimal and maximal values by DataFrame.xs
,首先是 min
和 max
的合计值:
np.random.seed(2021)
N = 10000
df = pd.DataFrame({'T':np.random.randint(100, size=N),
'N':np.random.randint(100, size=N)},
index=pd.timedelta_range(0, freq='100ms', periods=N)).sub(50)
# print (df)
def npwhere(df):
interval = pd.Timedelta(1/8, "s")
resampled_df = df[["T", "N"]].resample(interval).agg(['max','min'])
amax = resampled_df.xs('max', axis=1, level=1)
amin = resampled_df.xs('min', axis=1, level=1)
return pd.DataFrame(np.where(-amin > amax, amin, amax),
index=resampled_df.index,
columns=['T','N'])
resampled_df = npwhere(df)
print (resampled_df.head(10))
def extreme_agg(array_like):
# return max or min - which absolute value is greater
return max(array_like.max(), array_like.min(), key=abs)
interval = pd.Timedelta(1/8, "s")
resampled_df1 = df[["T", "N"]].resample(interval).agg(extreme_agg)
print (resampled_df1.head(10))
print (resampled_df.equals(resampled_df1))
True
In [206]: %timeit npwhere(df)
12.4 ms ± 46.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [207]: %timeit df[["T", "N"]].resample(interval).agg(lambda x: max(x, key = abs))
306 ms ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [208]: %timeit df[["T", "N"]].resample(interval).agg(extreme_agg)
2.29 s ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
这应该有效:
resampled_df = df[["T", "N"]].resample(interval).agg(lambda x: max(x, key = abs))