Python:根据日期索引的小时和最近的分钟连接两个数据帧

Python: Join two dataframes based on hour and nearest minute of date index

我有两个日期不同的数据框,如下所示:

df1 = pd.DataFrame(index=['2022-01-01 00:37:57', '2022-01-01 03:49:12', '2022-01-01 09:30:11'], columns = ['price'])
df1['price'] = [10,13,12]
df1.index = df1.index.rename('date')
df1:
                        price
date                      
2022-01-01 00:37:57     10
2022-01-01 03:49:12     13
2022-01-01 09:30:11     12 

df2 = pd.DataFrame(index=['2022-01-01 00:35:00', '2022-01-01 00:47:00', '2022-01-01 00:56:12', '2022-01-01 03:45:00', '2022-01-01 03:50:32',
                        '2022-01-01 09:29:20', '2022-01-01 09:31:21'], columns=['price'])
df2['price'] = [3000,3210, 2999, 3001, 3027, 3021, 3002]
df2.index = df2.index.rename('date')
df2:
                      price
date                      
2022-01-01 00:35:00   3000
2022-01-01 00:47:00   3210
2022-01-01 00:56:12   2999
2022-01-01 03:45:00   3001
2022-01-01 03:50:32   3027
2022-01-01 09:29:20   3021
2022-01-01 09:31:21   3002

我想离开 join df1 和 df2,df1.join(df2,how='left'),在小时和最近的分钟获得以下信息:

df:
                        price_x price_y
date
2022-01-01 00:37:57     10      3000
2022-01-01 03:49:12     13      3210
2022-01-01 09:30:11     12      3021

例如,最后一行在日期“2022-01-01 09:29:20”加入,因为它最接近“2022-01-01 09:30:11”。

如何做到这一点?

尝试pd.merge_asof()(假设索引为DateTime类型并排序):

print(
    pd.merge_asof(
        df1,
        df2,
        left_index=True,
        right_index=True,
        direction="nearest",
    )
)

打印:

                     price_x  price_y
date                                 
2022-01-01 00:37:57       10     3000
2022-01-01 03:49:12       13     3027
2022-01-01 09:30:11       12     3021

Anrej Kesely 给出了很好的回应。我猜 pandas 比我自己更有效。我没有添加评论来澄清您的问题的声誉。但是,如果您要查找 df2 中发生在 df1 中的日期之前的最近日期。此代码将起作用。

import pandas as pd
import numpy as np
from datetime import datetime
df1 = pd.DataFrame(index=['2022-01-01 00:37:57', '2022-01-01 03:49:12', '2022-01-01 09:30:11'], columns = ['price'])
df1['price'] = [10,13,12]
df1.index = df1.index.rename('date')
df1 = df1.reset_index()
df2 = pd.DataFrame(index=['2022-01-01 00:35:00', '2022-01-01 00:47:00', '2022-01-01 00:56:12', '2022-01-01 03:45:00', '2022-01-01 03:50:32',
                        '2022-01-01 09:29:20', '2022-01-01 09:31:21'], columns=['price'])
df2['price'] = [3000,3210, 2999, 3001, 3027, 3021, 3002]
df2.index = df2.index.rename('date')
df2 = df2.reset_index()
display(df1)
def min_diff(date, df):
    min_diff = -18000000
    min_index = -1
    for i in range(len(df)):
        difference = int(datetime.strptime((df['date'][i]),"%Y-%m-%d %H:%M:%S").timestamp()) - int(datetime.strptime(date,"%Y-%m-%d %H:%M:%S").timestamp())
        if difference < 0:
            if (difference > min_diff):
                min_diff = difference
                min_index = i
    return min_index

print(df2.loc[min_diff(df1['date'][0], df2)])   
df1['Price from 2'] = ''
for i in range(len(df1)):
    df1.loc[i,'Price from 2'] = df2.loc[min_diff(df1['date'][i], df2),'price']
display(df1)

这会显示以下内容,

                  date  price Price from 2
0  2022-01-01 00:37:57     10         3000
1  2022-01-01 03:49:12     13         3001
2  2022-01-01 09:30:11     12         3021

如果您只是在寻找最近的日期而不关心方向。 @Anrej Kesely 给出了很好的答案。希望我们中的任何一个帮助!