如何更快地根据多个条件合并 2 pandas 数据帧

how to merge 2 pandas daataframes base on multiple conditions faster

我有 2 个数据帧:

df1:

    RB  BeginDate   EndDate    Valindex0
0   00  19000100    19811231    45
1   00  19820100    19841299    47
2   00  19850100    20010699    50
3   00  20010700    99999999    39

df2:

    RB  IssueDate   gs
0   L3  19990201    8
1   00  19820101    G
2   48  19820101    G
3   50  19820101    G
4   50  19820101    G

如何在以下条件下合并这 2 个数据帧:

if df1['BeginDate'] <= df2['IssueDate'] <= df1['EndDate'] and df1['RB']==df2['RB']:
    merge the value of df1['Valindex0'] to df2

输出应该是:

df2:

    RB  IssueDate   gs  Valindex0
0   L3  19990201    8   None
1   00  19820101    G   47    # df2['RB']==df1['RB'] and df2['IssueDate'] between df1['BeginDate'] and df1['EndDate'] of this row
2   48  19820101    G   None
3   50  19820101    G   None
4   50  19820101    G   None

我知道一种方法,但是很慢:

conditions = []

for index, row in df1.iterrows():
    conditions.append((df2['IssueDate']>= df1['BeginDate']) &
                      (df2['IssueDate']<= df1['BeginDate'])&
                      (df2['RB']==df1['RB']))

    df2['Valindex0'] = np.select(conditions, df1['Valindex0'], default=None)

有更快的解决方案吗?

试试这些:

df2 = df2.merge(df1, left_on='RB', right_on='RB', how='inner')
df2 = df2[(df2['BeginDate'] <= df2['IssueDate']) & (df2['IssueDate'] <= df2['EndDate']]

你可以尝试使用sql,因为在pandas中它更复杂:

import pandas as pd
import sqlite3

conn = sqlite3.connect(':memory:')

df_1.to_sql('A', conn, index=False)
df_2.to_sql('B', conn, index=False)

qry = '''
    select  
        B.RB, B.IssueDate, B.gs, A.Valindex0
    from
        B left join A on
        (B.IssueDate between A.BeginDate and A.EndDate and B.RB = A.RB)
    '''
df = pd.read_sql_query(qry, conn)

#    RB  IssueDate gs  Valindex0
# 0  L3   19990201  8        NaN
# 1  00   19820101  G       47.0
# 2  48   19820101  G        NaN
# 3  50   19820101  G        NaN
# 4  50   19820101  G        NaN

解决方案

Uses: comparison with pd.Series.between + method chaining with pd.DataFrame.pipe

你可以试试这个。

Note: I have used a slightly more generic dataset (df1, df2) to see that it works for all RB values.

此解决方案能为您提供什么?

  • 合并(内连接)数据帧 df1df2
  • 使用 pandasDataFrame.pipe 的便利函数 update_column
    • 这将评估条件 BeginDate <= IssueDate <= EndDate
    • 并将 None 值分配给条件计算结果为 False 的任何行。
    • 如果此时检查输出数据帧,您将能够验证逻辑是否已正确实现,因为列 BeginDateEndDate 仍然可用。
  • 删除不必要的列(BeginDateEndDate)以获得最终结果。

代码

import pandas as pd

def update_column(df: pd.DataFrame, target_column: str="Valindex0"):
    cond = df["IssueDate"].between(df["BeginDate"], df["EndDate"])
    df.loc[~cond, target_column] = None
    return df

# evalute result
result = (df2
    .merge(df1, how='inner', on="RB")                ## merge dataframes on column "RB"
    .pipe(update_column, target_column="Valindex0")  ## using piping for custom logic
    .drop(columns=["BeginDate", "EndDate"])          ## drop unnecessary columns
)

## Output: result
#    RB  IssueDate gs  Valindex0
# 0  L3   19990201  8       51.0
# 1  L3   19990201  8       50.0
# 2  00   19820101  G        NaN
# 3  00   19820101  G        NaN
# 4  00   19820101  G        NaN
# 5  00   19820101  G        NaN
# 6  48   19820101  G       58.0
# 7  50   19870101  G       52.0
# 8  50   19820121  G        NaN

输出

这是结果数据帧的输出,在删除列 BeginDateEndDate 之前。

虚拟数据

加载数据帧df1

import pandas as pd
from io import StringIO

df1s = """
RB  BeginDate   EndDate    Valindex0
00  19000120    19801231    45
00  19820110    19841229    47
00  19850101    20010629    50
00  20010701    99991230    39
L3  19850101    20450630    51
L3  19850111    20010609    50
50  19850121    20010619    52
48  19810204    20010699    58
"""

df1 = pd.read_csv(StringIO(df1s.strip()), sep='\s+', 
                  dtype={"RB": str, "BeginDate": int, "EndDate": int})

加载数据帧df2

import pandas as pd
from io import StringIO

df2s = """
RB  IssueDate   gs
L3  19990201    8
00  19820101    G
48  19820101    G
50  19870101    G
50  19820121    G
"""

df2 = pd.read_csv(StringIO(df2s.strip()), sep='\s+', 
                  dtype={"RB": str, "IssueDate": int})