Pandas 在一个时间索引上合并多个数据帧,以及所有其他数据的最新值

Pandas merge multiple dataframes on one temporal index, with latest value from all others

我正在合并一些具有时间索引的数据帧。

import pandas as pd
df1 = pd.DataFrame(['a', 'b', 'c'],
    columns=pd.MultiIndex.from_product([['target'], ['key']]),
    index = [
        '2022-04-15 20:20:20.000000', 
        '2022-04-15 20:20:21.000000', 
        '2022-04-15 20:20:22.000000'],)
df2 = pd.DataFrame(['a2', 'b2', 'c2', 'd2', 'e2'],
    columns=pd.MultiIndex.from_product([['feature2'], ['keys']]),
    index = [
        '2022-04-15 20:20:20.100000', 
        '2022-04-15 20:20:20.500000', 
        '2022-04-15 20:20:20.900000', 
        '2022-04-15 20:20:21.000000', 
        '2022-04-15 20:20:21.100000',],)
df3 = pd.DataFrame(['a3', 'b3', 'c3', 'd3', 'e3'],
    columns=pd.MultiIndex.from_product([['feature3'], ['keys']]),
    index = [
        '2022-04-15 20:20:19.000000', 
        '2022-04-15 20:20:19.200000', 
        '2022-04-15 20:20:20.000000', 
        '2022-04-15 20:20:20.200000', 
        '2022-04-15 20:20:23.100000',],)

然后我使用这个合并程序:

def merge(dfs:list[pd.DataFrame], targetColumn:'str|tuple[str]'):
    from functools import reduce
    if len(dfs) == 0:
        return None
    if len(dfs) == 1:
        return dfs[0]
    for df in dfs:
        df.index = pd.to_datetime(df.index)
    merged = reduce(
        lambda left, right: pd.merge(
            left, 
            right, 
            how='outer',
            left_index=True,
            right_index=True),
        dfs)
    for col in merged.columns:
        if col != targetColumn:
            merged[col] = merged[col].fillna(method='ffill')
    return merged[merged[targetColumn].notna()]

像这样:

merged = merge([df1, df2, df3], targetColumn=('target', 'key'))

产生这个:

一切都很好。问题是效率——注意在 merge() 过程中我使用 reduce 和外部合并将数据帧连接在一起,这可以产生一个巨大的临时数据帧,然后被过滤掉。但是,如果我的电脑没有足够的内存来处理内存中的巨大数据帧怎么办?好吧,这就是我要避免的问题。

我想知道是否有一种方法可以避免在合并时将数据扩展到一个巨大的数据框中。

当然,常规的旧合并是不够的,因为它只合并完全匹配的索引,而不是目标变量观察之前的最新时间索引:

df1.merge(df2, how='left', left_index=True, right_index=True)

这种事情有高效解决了吗?似乎是一个常见的数据科学问题,因为没有人愿意将未来的信息泄漏到他们的模型中,而且每个人都有各种输入要合并在一起...

您很幸运:pandas.merge_asof 完全满足您的需求!

我们使用默认的 direction='backward' 参数:

A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.

使用您的三个示例 DataFrame:

import pandas as pd
from functools import reduce

# Convert all indexes to datetime
for df in [df1, df2, df3]:
    df.index = pd.to_datetime(df.index)

# Perform as-of merges
res = reduce(lambda left, right:
             pd.merge_asof(left, right, left_index=True, right_index=True),
             [df1, df2, df3])

print(res)

                    target feature2 feature3
                       key     keys     keys
2022-04-15 20:20:20      a      NaN       c3
2022-04-15 20:20:21      b       d2       d3
2022-04-15 20:20:22      c       e2       d3

下面是一些适用于您的示例的代码。我不确定 multi-indexed 列的更一般情况,但无论如何它包含在单个时间索引上合并的基本思想。

merged = df1.copy(deep=True)
for df in [df2, df3]:
    idxNew = df.index.get_indexer(merged.index, method='pad')
    idxMerged = [i for i, x in enumerate(idxNew) if x != -1]
    idxNew = [x for x in idxNew if x != -1]
    n = len(merged.columns)
    merged[df.columns] = None
    merged.iloc[idxMerged,n:] = df.iloc[idxNew,:].set_index(merged.index[idxMerged])
print(merged)

输出:

                           target feature2 feature3
                              key     keys     keys
2022-04-15 20:20:20.000000      a     None       c3
2022-04-15 20:20:21.000000      b       d2       d3
2022-04-15 20:20:22.000000      c       e2       d3