如何在给定特定的每日时间过滤数据框?

How to filter a dataframe given a specific daily hour?

给定两个数据框:

df1:

datetime              v
2020-10-01 12:00:00   15
2020-10-02            4
2020-10-03 07:00:00   3
2020-10-03 08:01:00   51
2020-10-03 09:00:00   9

df2:

datetime              p
2020-10-01 11:00:00   1
2020-10-01 12:00:00   2
2020-10-02 13:00:00   14
2020-10-02 13:01:00   5
2020-10-03 20:00:00   12
2020-10-03 02:01:00   30
2020-10-03 07:00:00   7

我想将这两个数据帧合并为一个,策略是每天查找 08:00 附近最接近的值。最后的结果应该是

datetime              v       p
2020-10-01 08:00:00   15      1
2020-10-02 08:00:00   4       14
2020-10-03 08:00:00   51      7

我该如何实现?

给定以下数据帧:

import pandas as pd

df1 = pd.DataFrame(
    {
        "datetime": [
            "2020-10-01 12:00:00",
            "2020-10-02",
            "2020-10-03 07:00:00",
            "2020-10-03 08:01:00",
            "2020-10-03 09:00:00",
        ],
        "v": [15, 4, 3, 51, 9],
    }
)

df2 = pd.DataFrame(
    {
        "datetime": [
            "2020-10-01 11:00:00",
            "2020-10-01 12:00:00",
            "2020-10-02 13:00:00",
            "2020-10-02 13:01:00",
            "2020-10-03 20:00:00",
            "2020-10-03 02:01:00",
            "2020-10-03 07:00:00",
        ],
        "p": [1, 2, 14, 5, 12, 30, 7],
    }
)

您可以定义一个辅助函数:

def align(df):
    # Set proper type
    df["datetime"] = pd.to_datetime(df["datetime"])

    # Slice df by day
    dfs = [
        df.copy().loc[df["datetime"].dt.date == item, :]
        for item in df["datetime"].dt.date.unique()
    ]

    # Evaluate distance in seconds between given hour and 08:00:00 and filter on min
    for i, df in enumerate(dfs):
        df["target"] = pd.to_datetime(df["datetime"].dt.date.astype(str) + " 08:00:00")

        df["distance"] = (
            df["target"].map(lambda x: x.hour * 3600 + x.minute * 60 + x.second)
            - df["datetime"].map(lambda x: x.hour * 3600 + x.minute * 60 + x.second)
        ).abs()

        dfs[i] = df.loc[df["distance"].idxmin(), :]

    # Concatenate filtered dataframes
    df = (
        pd.concat(dfs, axis=1)
        .T.drop(columns=["datetime", "distance"])
        .rename(columns={"target": "datetime"})
        .set_index("datetime")
    )
    return df

申请df1和df2然后合并:

df = pd.merge(
    right=align(df1), left=align(df2), how="outer", right_index=True, left_index=True
).reindex(columns=["v", "p"])
print(df)
# Output
                      v   p
datetime
2020-10-01 08:00:00  15   1
2020-10-02 08:00:00   4  14
2020-10-03 08:00:00  51   7