如何更快地根据多个条件合并 2 pandas 数据帧
how to merge 2 pandas daataframes base on multiple conditions faster
我有 2 个数据帧:
df1:
RB BeginDate EndDate Valindex0
0 00 19000100 19811231 45
1 00 19820100 19841299 47
2 00 19850100 20010699 50
3 00 20010700 99999999 39
df2:
RB IssueDate gs
0 L3 19990201 8
1 00 19820101 G
2 48 19820101 G
3 50 19820101 G
4 50 19820101 G
如何在以下条件下合并这 2 个数据帧:
if df1['BeginDate'] <= df2['IssueDate'] <= df1['EndDate'] and df1['RB']==df2['RB']:
merge the value of df1['Valindex0'] to df2
输出应该是:
df2:
RB IssueDate gs Valindex0
0 L3 19990201 8 None
1 00 19820101 G 47 # df2['RB']==df1['RB'] and df2['IssueDate'] between df1['BeginDate'] and df1['EndDate'] of this row
2 48 19820101 G None
3 50 19820101 G None
4 50 19820101 G None
我知道一种方法,但是很慢:
conditions = []
for index, row in df1.iterrows():
conditions.append((df2['IssueDate']>= df1['BeginDate']) &
(df2['IssueDate']<= df1['BeginDate'])&
(df2['RB']==df1['RB']))
df2['Valindex0'] = np.select(conditions, df1['Valindex0'], default=None)
有更快的解决方案吗?
试试这些:
df2 = df2.merge(df1, left_on='RB', right_on='RB', how='inner')
df2 = df2[(df2['BeginDate'] <= df2['IssueDate']) & (df2['IssueDate'] <= df2['EndDate']]
你可以尝试使用sql,因为在pandas中它更复杂:
import pandas as pd
import sqlite3
conn = sqlite3.connect(':memory:')
df_1.to_sql('A', conn, index=False)
df_2.to_sql('B', conn, index=False)
qry = '''
select
B.RB, B.IssueDate, B.gs, A.Valindex0
from
B left join A on
(B.IssueDate between A.BeginDate and A.EndDate and B.RB = A.RB)
'''
df = pd.read_sql_query(qry, conn)
# RB IssueDate gs Valindex0
# 0 L3 19990201 8 NaN
# 1 00 19820101 G 47.0
# 2 48 19820101 G NaN
# 3 50 19820101 G NaN
# 4 50 19820101 G NaN
解决方案
Uses: comparison with pd.Series.between
+ method chaining with pd.DataFrame.pipe
你可以试试这个。
Note: I have used a slightly more generic dataset (df1, df2) to see that it works for all RB values.
此解决方案能为您提供什么?
- 合并(内连接)数据帧
df1
和 df2
- 使用
pandasDataFrame.pipe
的便利函数 update_column
:
- 这将评估条件
BeginDate <= IssueDate <= EndDate
- 并将
None
值分配给条件计算结果为 False
的任何行。
- 如果此时检查输出数据帧,您将能够验证逻辑是否已正确实现,因为列
BeginDate
和 EndDate
仍然可用。
- 删除不必要的列(
BeginDate
和 EndDate
)以获得最终结果。
代码
import pandas as pd
def update_column(df: pd.DataFrame, target_column: str="Valindex0"):
cond = df["IssueDate"].between(df["BeginDate"], df["EndDate"])
df.loc[~cond, target_column] = None
return df
# evalute result
result = (df2
.merge(df1, how='inner', on="RB") ## merge dataframes on column "RB"
.pipe(update_column, target_column="Valindex0") ## using piping for custom logic
.drop(columns=["BeginDate", "EndDate"]) ## drop unnecessary columns
)
## Output: result
# RB IssueDate gs Valindex0
# 0 L3 19990201 8 51.0
# 1 L3 19990201 8 50.0
# 2 00 19820101 G NaN
# 3 00 19820101 G NaN
# 4 00 19820101 G NaN
# 5 00 19820101 G NaN
# 6 48 19820101 G 58.0
# 7 50 19870101 G 52.0
# 8 50 19820121 G NaN
输出
这是结果数据帧的输出,在删除列 BeginDate
和 EndDate
之前。
虚拟数据
加载数据帧df1
。
import pandas as pd
from io import StringIO
df1s = """
RB BeginDate EndDate Valindex0
00 19000120 19801231 45
00 19820110 19841229 47
00 19850101 20010629 50
00 20010701 99991230 39
L3 19850101 20450630 51
L3 19850111 20010609 50
50 19850121 20010619 52
48 19810204 20010699 58
"""
df1 = pd.read_csv(StringIO(df1s.strip()), sep='\s+',
dtype={"RB": str, "BeginDate": int, "EndDate": int})
加载数据帧df2
。
import pandas as pd
from io import StringIO
df2s = """
RB IssueDate gs
L3 19990201 8
00 19820101 G
48 19820101 G
50 19870101 G
50 19820121 G
"""
df2 = pd.read_csv(StringIO(df2s.strip()), sep='\s+',
dtype={"RB": str, "IssueDate": int})
我有 2 个数据帧:
df1:
RB BeginDate EndDate Valindex0
0 00 19000100 19811231 45
1 00 19820100 19841299 47
2 00 19850100 20010699 50
3 00 20010700 99999999 39
df2:
RB IssueDate gs
0 L3 19990201 8
1 00 19820101 G
2 48 19820101 G
3 50 19820101 G
4 50 19820101 G
如何在以下条件下合并这 2 个数据帧:
if df1['BeginDate'] <= df2['IssueDate'] <= df1['EndDate'] and df1['RB']==df2['RB']:
merge the value of df1['Valindex0'] to df2
输出应该是:
df2:
RB IssueDate gs Valindex0
0 L3 19990201 8 None
1 00 19820101 G 47 # df2['RB']==df1['RB'] and df2['IssueDate'] between df1['BeginDate'] and df1['EndDate'] of this row
2 48 19820101 G None
3 50 19820101 G None
4 50 19820101 G None
我知道一种方法,但是很慢:
conditions = []
for index, row in df1.iterrows():
conditions.append((df2['IssueDate']>= df1['BeginDate']) &
(df2['IssueDate']<= df1['BeginDate'])&
(df2['RB']==df1['RB']))
df2['Valindex0'] = np.select(conditions, df1['Valindex0'], default=None)
有更快的解决方案吗?
试试这些:
df2 = df2.merge(df1, left_on='RB', right_on='RB', how='inner')
df2 = df2[(df2['BeginDate'] <= df2['IssueDate']) & (df2['IssueDate'] <= df2['EndDate']]
你可以尝试使用sql,因为在pandas中它更复杂:
import pandas as pd
import sqlite3
conn = sqlite3.connect(':memory:')
df_1.to_sql('A', conn, index=False)
df_2.to_sql('B', conn, index=False)
qry = '''
select
B.RB, B.IssueDate, B.gs, A.Valindex0
from
B left join A on
(B.IssueDate between A.BeginDate and A.EndDate and B.RB = A.RB)
'''
df = pd.read_sql_query(qry, conn)
# RB IssueDate gs Valindex0
# 0 L3 19990201 8 NaN
# 1 00 19820101 G 47.0
# 2 48 19820101 G NaN
# 3 50 19820101 G NaN
# 4 50 19820101 G NaN
解决方案
Uses: comparison with
pd.Series.between
+ method chaining withpd.DataFrame.pipe
你可以试试这个。
Note: I have used a slightly more generic dataset (df1, df2) to see that it works for all RB values.
此解决方案能为您提供什么?
- 合并(内连接)数据帧
df1
和df2
- 使用
pandasDataFrame.pipe
的便利函数update_column
:- 这将评估条件
BeginDate <= IssueDate <= EndDate
- 并将
None
值分配给条件计算结果为False
的任何行。 - 如果此时检查输出数据帧,您将能够验证逻辑是否已正确实现,因为列
BeginDate
和EndDate
仍然可用。
- 这将评估条件
- 删除不必要的列(
BeginDate
和EndDate
)以获得最终结果。
代码
import pandas as pd
def update_column(df: pd.DataFrame, target_column: str="Valindex0"):
cond = df["IssueDate"].between(df["BeginDate"], df["EndDate"])
df.loc[~cond, target_column] = None
return df
# evalute result
result = (df2
.merge(df1, how='inner', on="RB") ## merge dataframes on column "RB"
.pipe(update_column, target_column="Valindex0") ## using piping for custom logic
.drop(columns=["BeginDate", "EndDate"]) ## drop unnecessary columns
)
## Output: result
# RB IssueDate gs Valindex0
# 0 L3 19990201 8 51.0
# 1 L3 19990201 8 50.0
# 2 00 19820101 G NaN
# 3 00 19820101 G NaN
# 4 00 19820101 G NaN
# 5 00 19820101 G NaN
# 6 48 19820101 G 58.0
# 7 50 19870101 G 52.0
# 8 50 19820121 G NaN
输出
这是结果数据帧的输出,在删除列 BeginDate
和 EndDate
之前。
虚拟数据
加载数据帧df1
。
import pandas as pd
from io import StringIO
df1s = """
RB BeginDate EndDate Valindex0
00 19000120 19801231 45
00 19820110 19841229 47
00 19850101 20010629 50
00 20010701 99991230 39
L3 19850101 20450630 51
L3 19850111 20010609 50
50 19850121 20010619 52
48 19810204 20010699 58
"""
df1 = pd.read_csv(StringIO(df1s.strip()), sep='\s+',
dtype={"RB": str, "BeginDate": int, "EndDate": int})
加载数据帧df2
。
import pandas as pd
from io import StringIO
df2s = """
RB IssueDate gs
L3 19990201 8
00 19820101 G
48 19820101 G
50 19870101 G
50 19820121 G
"""
df2 = pd.read_csv(StringIO(df2s.strip()), sep='\s+',
dtype={"RB": str, "IssueDate": int})