什么比 np.sum 和 numpy 布尔运算符更有效?

What is more efficient that np.sum and numpy boolean operators?

我在将代码快速 运行 时遇到了一些问题。

在我的代码上使用逐行分析器后,我发现以下几行是我效率低下的主要原因:

import numpy as np
import datetime

timestamps = np.array(timestamps)
mask = (minTime <= timestamps) & (timestamps <= maxTime)
count = np.sum(mask)

timestamps 以日期时间列表开始,minTime 是单个日期时间。

时间戳的示例值:

minTime = datetime.datetime(2020, 5, 21, 2, 27, 26)

timestamps = [datetime.datetime(2020, 5, 21, 2, 27, 26), datetime.datetime(2020, 5, 21, 2, 27, 26), 
 datetime.datetime(2020, 5, 21, 2, 27, 26), datetime.datetime(2020, 5, 21, 2, 30, 55),
 datetime.datetime(2020, 5, 21, 2, 30, 55), datetime.datetime(2020, 5, 21, 2, 30, 55),
 datetime.datetime(2020, 5, 21, 2, 34, 26), datetime.datetime(2020, 5, 21, 2, 34, 26),
 datetime.datetime(2020, 5, 21, 2, 34, 26), datetime.datetime(2020, 5, 21, 2, 39, 26),
 datetime.datetime(2020, 5, 21, 2, 39, 26), datetime.datetime(2020, 5, 21, 2, 39, 26)]

有没有更高效的方法重写上面的代码?

如有任何建议,我们将不胜感激。

看起来 numpy.datetime64 对象非常快。大约比标准库提速 2 倍 datetime。 Pandas 在这里有点挣扎。如果您使用 pandas 时间戳作为 Series 对象的索引并使用 .loc 访问器,它会比您在下面看到的要好一点。但也好不了多少。

from datetime import datetime

import numpy
import pandas


py_dts = numpy.array([
    datetime(2020, 5, 21, 2, 27, 26),
    datetime(2020, 5, 21, 2, 27, 26), 
    datetime(2020, 5, 21, 2, 27, 26),
    datetime(2020, 5, 21, 2, 30, 55),
    datetime(2020, 5, 21, 2, 30, 55),
    datetime(2020, 5, 21, 2, 30, 55),
    datetime(2020, 5, 21, 2, 34, 26),
    datetime(2020, 5, 21, 2, 34, 26),
    datetime(2020, 5, 21, 2, 34, 26),
    datetime(2020, 5, 21, 2, 39, 26),
    datetime(2020, 5, 21, 2, 39, 26),
    datetime(2020, 5, 21, 2, 39, 26)
])

min_pydt = datetime(2020, 5, 21, 2, 27, 26)
max_pydt = datetime(2020, 5, 21, 2, 39, 26)

min_npdt = numpy.datetime64(min_pydt)
max_npdt = numpy.datetime64(max_pydt)

min_pddt = pandas.Timestamp(min_pydt)
max_pddt = pandas.Timestamp(max_pydt)

np_64s = numpy.array([numpy.datetime64(d) for d in py_dts])
pd_tss = pandas.Series([pandas.Timestamp(d) for d in py_dts])


def counter(timestamps, mindt, maxdt):    
    return ((mindt <= timestamps) & (timestamps <= maxdt)).sum()

在我做的 Jupyter notebook 中:

%%timeit
counter(py_dts, min_pydt, max_pydt)

17.4 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
counter(np_64s, min_npdt, max_npdt)

7.42 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
counter(pd_tss, min_pddt, max_pddt)

531 µs ± 2.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)