如何对多索引数据帧进行上采样以确保每个分组涵盖相同的时间范围(提供自定义开始和结束日期时间)

How to upsample a multi-index dataframe ensuring each grouping covers the same time range (provide custom starting and ending datetimes)

这里有一个虚拟的例子来说明这个问题。

我对升采样到每年年初 (AS) 很感兴趣,对于每个国家/地区,我想涵盖从 1995 年到 2000 年的这段时间。

假设我们有以下数据集:

df = pd.DataFrame({
    'year': [
        '1995-01-01', '1997-01-01', 
        '1997-01-01', '1998-01-01', '2000-01-01',
        '1996-01-01', '1999-01-01', 
    ],
    'country': [
        'ES', 'ES', 
        'GB', 'GB', 'GB',
        'DE', 'DE',
    ],
    'members': [
        100, 101, 
        200, 201, 202,
        300, 301,
    ]
})
df['year']= pd.to_datetime(df['year'])
df = df.set_index(['country', 'year'])
print(df)
                    members
country year               
ES      1995-01-01      100
        1997-01-01      101

GB      1997-01-01      200
        1998-01-01      201
        2000-01-01      202

DE      1996-01-01      300
        1999-01-01      301

如您所见,没有任何国家/地区拥有 1995 年至 2000 年之间所有年份的可用数据。请注意,一些国家/地区也缺少 1995 年,而其他国家/地区则缺少 2000 年。

我知道如何对数据框进行上采样,以便为每个国家/地区填充中间缺失的年份(例如,将 1996 年添加到西班牙)。

def my_upsample(df):
    return (
        df
        .reset_index('country')               # upsampling multi-index wit the keyword level is not supported
        .groupby('country', group_keys=False) # hence this little trick of single-indexing & grouping
                                            # see this issue for details: https://github.com/pandas-dev/pandas/issues/28313
        .resample('AS') # resample to the beggining of each year
        .apply({
            'country':'pad',    # pad the countries
            'members':'asfreq', # but leave the number of members as NaN. Irrelevant in this dummy example, but 
                                # the desired behaviour in my real-world problem
        })
    )

print(my_upsample(df))
           country  members
year                       
1996-01-01      DE    300.0
1997-01-01      DE      NaN
1998-01-01      DE      NaN
1999-01-01      DE    301.0

1995-01-01      ES    100.0
1996-01-01      ES      NaN
1997-01-01      ES    101.0

1997-01-01      GB    200.0
1998-01-01      GB    201.0
1999-01-01      GB      NaN
2000-01-01      GB    202.0

但我想做的是确保所有国家都涵盖从 1995 年到 2000 年的时期。

所需的输出应如下所示:

           country  members
year                       
1995-01-01      DE      NaN
1996-01-01      DE    300.0
1997-01-01      DE      NaN
1998-01-01      DE      NaN
1999-01-01      DE    301.0
2000-01-01      DE      NaN

1995-01-01      ES    100.0
1996-01-01      ES      NaN
1997-01-01      ES    101.0
1998-01-01      ES      NaN
1999-01-01      ES      NaN
2000-01-01      ES      NaN

1995-01-01      GB      NaN
1996-01-01      GB      NaN
1997-01-01      GB    200.0
1998-01-01      GB    201.0
1999-01-01      GB      NaN
2000-01-01      GB    202.0

我可以使用 python 循环遍历每个国家并添加缺失的行(参见下面的代码),但我想知道实现此目的的 pandas 方法是什么?

for country in df.index.levels[0]:
    if not (country, '1995-01-01') in df.query(f"country == @country").index:
        # if this country doesn't have the year 1995 create the row with NaN as value
        df.loc[(country, '1995-01-01'),:] = np.nan 

    if not (country, '2000-01-01') in df.query(f"country == @country").index:
        # if this country doesn't have the year 2000 create the row with NaN as value
        df.loc[(country, '2000-01-01'),:] = np.nan
print(df.sort_index())
                    members
country year               
DE      1995-01-01      NaN
        1996-01-01    300.0
        1999-01-01    301.0
        2000-01-01      NaN
ES      1995-01-01    100.0
        1997-01-01    101.0
        2000-01-01      NaN
GB      1995-01-01      NaN
        1997-01-01    200.0
        1998-01-01    201.0
        2000-01-01    202.0

然后运行my_upsamplereturns期望的输出:

print(my_upsample(df.sort_index()))
           country  members
year                       
1995-01-01      DE      NaN
1996-01-01      DE    300.0
1997-01-01      DE      NaN
1998-01-01      DE      NaN
1999-01-01      DE    301.0
2000-01-01      DE      NaN

1995-01-01      ES    100.0
1996-01-01      ES      NaN
1997-01-01      ES    101.0
1998-01-01      ES      NaN
1999-01-01      ES      NaN
2000-01-01      ES      NaN

1995-01-01      GB      NaN
1996-01-01      GB      NaN
1997-01-01      GB    200.0
1998-01-01      GB    201.0
1999-01-01      GB      NaN
2000-01-01      GB    202.0

我相信有更好的方法,但这里有一种方法可以实现:

def my_upsample(df):
    # Get all periods
    years = df.index.get_level_values(1)
    years = pd.date_range(years.min(), years.max(), freq="as")

    # Reindex and format
    return (
        df.unstack(level=0)
        .reindex(years)
        .unstack()
        .reset_index((0, 1), name="members")
        .drop("level_0", axis=1)
    )

输出:

           country  members
1995-01-01      DE      NaN
1996-01-01      DE    300.0
1997-01-01      DE      NaN
1998-01-01      DE      NaN
1999-01-01      DE    301.0
2000-01-01      DE      NaN
1995-01-01      ES    100.0
1996-01-01      ES      NaN
1997-01-01      ES    101.0
1998-01-01      ES      NaN
1999-01-01      ES      NaN
2000-01-01      ES      NaN
1995-01-01      GB      NaN
1996-01-01      GB      NaN
1997-01-01      GB    200.0
1998-01-01      GB    201.0
1999-01-01      GB      NaN
2000-01-01      GB    202.0

以下是对每个步骤的一些解释:

  1. unstack(level=0):旋转索引(level=0部分将“年”设置为索引,从而允许传入重新索引)
  2. reindex(years):重新索引到目标日期范围。请注意,在您的特定示例中,这实际上不是必需的,因为您的示例已经包含所有年份至少一次;
  3. unstack():再次转向。由于索引不是 MultiIndex,在此处旋转将 return 具有层次索引的 Series:“成员”>“国家/地区”>“年”。在这个阶段我们基本上完成了,只需要将其格式化为所需的 DataFrame;
  4. reset_index((0, 1), name='members'):只保留“年”作为索引,将原来的Series重命名为“成员”;
  5. drop('level_0', axis=1): 删除不需要的列