在已排序的 pandas 数据框中按时间戳搜索元素

Question

我有一个非常大的 pandas dataframe/series，其中包含数百万个元素。我需要找到时间戳小于 t0 的所有元素。所以通常我会做的是：

selected_df = df[df.index < t0]

这需要很长时间。据我了解，当 pandas 搜索时，它会遍历数据框的每个元素。但是我知道我的数据帧已排序，因此我可以在时间戳 > t0 时立即中断循环。我假设 pandas 不知道数据帧已排序并搜索所有时间戳。

我已经尝试使用 pandas.Series 代替 - 仍然很慢。我试过像这样编写自己的循环：

boudery = 0
ticks_time_list = df.index
tsearch = ticks_time_list[0]
while tsearch < t0:
      tsearch = ticks_time_list[boudery]
      boudery += 1      
selected_df = df[:boudery]

这比 pandas 搜索花费的时间还要长。我能看到 atm 的唯一解决方案是使用 Cython。知道如何在不涉及 C 的情况下对其进行排序吗？

Answer 1

（我对 Pandas 不是很熟悉，但这描述了一个非常通用的想法 - 你应该能够应用它。如有必要，调整 Pandas 特定的功能。）您可以尝试使用更有效的搜索。目前您正在使用线性搜索，遍历所有元素。相反，试试这个

ticks_time_list=df.index
tsearch_min = 0
tsearch_max = len(ticks_time_list)-1 #I'm not sure on whether this works on a pandas dataset
while True:
    tsearch_middle = int((tsearch_max-tsearch_min)/2)
    if ticks_time_list[tsearch_middle] < t0:
        tsearch_min = tsearch_middle
    else:
        tsearch_max = tsearch_middle
    if tsearch_max == tsearch_min:
        break
# tsearch_max == tsearch_min and is the value of the index you are looking for

它不会打开每个元素并查看时间戳，而是尝试通过始终将搜索 space 减半来缩小搜索范围来找到 "boundary"。

Answer 2

对我来说似乎并不需要很长时间，即使是长框架：

>>> df = pd.DataFrame({"A": 2, "B": 3}, index=pd.date_range("2001-01-01", freq="1 min", periods=10**7))
>>> len(df)
10000000
>>> %timeit df[df.index < "2001-09-01"]
100 loops, best of 3: 18.5 ms per loop

但是如果我们真的想挤出每一滴性能，我们可以在下降到 numpy:

之后使用 searchsorted 方法

>>> %timeit df.iloc[:df.index.values.searchsorted(np.datetime64("2001-09-01"))]
10000 loops, best of 3: 51.9 µs per loop
>>> df[df.index < "2001-09-01"].equals(df.iloc[:df.index.values.searchsorted(np.datetime64("2001-09-01"))])
True

快了很多倍。

在已排序的 pandas 数据框中按时间戳搜索元素

Search for elements by timestamp in a sorted pandas dataframe

python

pandas