使用新数据更新数据框的最有效方法

Question

我有一个包含数百列的“存档”数据框，每列代表一个时间序列（S1、S2...）

        S1  S2
Date1   5   5
Date2   8   10

我需要更新存档，从几个 dfs 导入新数据（每个新日期我都有一个以上的“new_data”df）。所以，例如：

new_data1:

        S3
Date3   8

new_data2:

        S2  S4
Date3   9   5

new_data3:

        S3
Date4   5

new_data4:

        S4
Date4   9

所以每个 new_data df 共享存档 df 的一些列，但也可以有一些新列。这是 Archive df 更新后的样子：

        S1  S2  S3  S4
Date1   5   5   NaN NaN
Date2   8   10  10  9
Date3   NaN 9   8   5
Date4   NaN NaN 5   9

我从看到我可以将存档 df 与 new_data df 进行外部合并，然后合并合并将创建的重复列（_x 和 _y）：

dataframes = [new_data1, new_data2, new_data3, new_data4]

for i in dataframes:
    # Merge the dataframe
    archive = archive.merge(i, how='outer', on='Date')

    # Get the series names
    series_names = i.columns

    # Combine duplicate columns
    for series_name in series_names:
       if series_name+"_x" in archive.columns:
           x = series_name+"_x"
           y = series_name+"_y"
           archive[series_name] = archive[y].fillna(archive[x])
           archive.drop([x, y], 1, inplace=True)

我想知道是否有更有效的方法来做同样的事情。谢谢

Answer 1

您描述的内容听起来像是 SQL 系统的“upsert”。 pandas 中的等价物是 combine_first:

for i in dataframes:
    archive = i.combine_first(archive)

使用新数据更新数据框的最有效方法

Most efficient way to update dataframe with fresh data

python

time-series

dataframe

pandas