根据多个条件（包括日期范围）将结果从一个数据框过滤到另一个数据框中的最快方法

Question

这个post的目的：高性能过滤

我已经针对这个问题进行了相当多的搜索，但是我发现的 post 要么在较大的数据帧上性能不佳，要么没有解决我的确切问题。

问题：

我有以下数据框，其中每个客户上传所需文件（记录在数据框 1 中）并且客户购买产品（记录在数据框 2 中）。

通俗地说，在客户购买产品时，我们正在尝试检索他应该上传的特定文档的最新状态。如果客户没有上传文件，结果应该是 None.

dataframe_2 的每行应应用以下三个过滤条件：

df_1.user == df_2.user
df_1.type == x
df_1.date_1 <= df_2.date_2

#e.g date_1 from dataframe_1 is the MAXIMUM date possible <= date_2 in dataframe_2).

一旦我过滤了上述这些条件，我们想要检索文档的状态（或者 None 如果它不存在）并在 dataframe_2.[=15 中创建该列=]

数据帧 1:

document_type	user	date_1	status
x	123	2021-01-01	approved
y	123	2021-01-01	approved
x	123	2022-02-03	declined

数据帧 2：

id	user	date_2
1	123	2021-01-01
2	123	2021-01-01
3	123	2021-05-04
4	123	2022-02-05
5	456	2021-07-30

结果：

id	user	date_2	document_x_status
1	123	2021-01-01	Approved
2	123	2021-01-01	Approved
3	123	2021-05-04	Approved
4	123	2022-02-05	Declined
5	456	2021-07-30	None

我尝试了很多方法，从多索引过滤到使用 to_numpy() 将字段转换为数组并尝试以这种方式进行过滤。

所有方法都花费了相当长的时间，而且由于数据量大，这才开始成为一个问题。

非常感谢您的帮助。

Answer 1

您可以尝试使用 pd.merge_asof 并将日期作为索引：

import pandas as pd

# Use sort_index if the dates are not already sorted (required for merge_asof).
df1 = df1[df1['document_type'].eq('x')].set_index('date_1').sort_index()
df2 = df2.set_index('date_2').sort_index()

res = (pd.merge_asof(df2, df1, by='user', left_index=True, right_index=True, direction='backward').
       drop(columns=['document_type']).fillna('None').reset_index()
      )

print(res)

      date_2  index  id  user    status
0 2021-01-01      0   1   123  approved
1 2021-01-01      1   2   123  approved
2 2021-05-04      2   3   123  approved
3 2021-07-30      4   5   456      None
4 2022-02-05      3   4   123  declined

根据多个条件（包括日期范围）将结果从一个数据框过滤到另一个数据框中的最快方法

Fastest way to filter results from one dataframe into another dataframe based on multiple conditions (including date range)

python

filtering

slice

dataframe

pandas