如何对多索引数据帧进行上采样以确保每个分组涵盖相同的时间范围(提供自定义开始和结束日期时间)
How to upsample a multi-index dataframe ensuring each grouping covers the same time range (provide custom starting and ending datetimes)
这里有一个虚拟的例子来说明这个问题。
我对升采样到每年年初 (AS
) 很感兴趣,对于每个国家/地区,我想涵盖从 1995 年到 2000 年的这段时间。
假设我们有以下数据集:
df = pd.DataFrame({
'year': [
'1995-01-01', '1997-01-01',
'1997-01-01', '1998-01-01', '2000-01-01',
'1996-01-01', '1999-01-01',
],
'country': [
'ES', 'ES',
'GB', 'GB', 'GB',
'DE', 'DE',
],
'members': [
100, 101,
200, 201, 202,
300, 301,
]
})
df['year']= pd.to_datetime(df['year'])
df = df.set_index(['country', 'year'])
print(df)
members
country year
ES 1995-01-01 100
1997-01-01 101
GB 1997-01-01 200
1998-01-01 201
2000-01-01 202
DE 1996-01-01 300
1999-01-01 301
如您所见,没有任何国家/地区拥有 1995 年至 2000 年之间所有年份的可用数据。请注意,一些国家/地区也缺少 1995 年,而其他国家/地区则缺少 2000 年。
我知道如何对数据框进行上采样,以便为每个国家/地区填充中间缺失的年份(例如,将 1996 年添加到西班牙)。
def my_upsample(df):
return (
df
.reset_index('country') # upsampling multi-index wit the keyword level is not supported
.groupby('country', group_keys=False) # hence this little trick of single-indexing & grouping
# see this issue for details: https://github.com/pandas-dev/pandas/issues/28313
.resample('AS') # resample to the beggining of each year
.apply({
'country':'pad', # pad the countries
'members':'asfreq', # but leave the number of members as NaN. Irrelevant in this dummy example, but
# the desired behaviour in my real-world problem
})
)
print(my_upsample(df))
country members
year
1996-01-01 DE 300.0
1997-01-01 DE NaN
1998-01-01 DE NaN
1999-01-01 DE 301.0
1995-01-01 ES 100.0
1996-01-01 ES NaN
1997-01-01 ES 101.0
1997-01-01 GB 200.0
1998-01-01 GB 201.0
1999-01-01 GB NaN
2000-01-01 GB 202.0
但我想做的是确保所有国家都涵盖从 1995 年到 2000 年的时期。
所需的输出应如下所示:
country members
year
1995-01-01 DE NaN
1996-01-01 DE 300.0
1997-01-01 DE NaN
1998-01-01 DE NaN
1999-01-01 DE 301.0
2000-01-01 DE NaN
1995-01-01 ES 100.0
1996-01-01 ES NaN
1997-01-01 ES 101.0
1998-01-01 ES NaN
1999-01-01 ES NaN
2000-01-01 ES NaN
1995-01-01 GB NaN
1996-01-01 GB NaN
1997-01-01 GB 200.0
1998-01-01 GB 201.0
1999-01-01 GB NaN
2000-01-01 GB 202.0
我可以使用 python 循环遍历每个国家并添加缺失的行(参见下面的代码),但我想知道实现此目的的 pandas 方法是什么?
for country in df.index.levels[0]:
if not (country, '1995-01-01') in df.query(f"country == @country").index:
# if this country doesn't have the year 1995 create the row with NaN as value
df.loc[(country, '1995-01-01'),:] = np.nan
if not (country, '2000-01-01') in df.query(f"country == @country").index:
# if this country doesn't have the year 2000 create the row with NaN as value
df.loc[(country, '2000-01-01'),:] = np.nan
print(df.sort_index())
members
country year
DE 1995-01-01 NaN
1996-01-01 300.0
1999-01-01 301.0
2000-01-01 NaN
ES 1995-01-01 100.0
1997-01-01 101.0
2000-01-01 NaN
GB 1995-01-01 NaN
1997-01-01 200.0
1998-01-01 201.0
2000-01-01 202.0
然后运行my_upsample
returns期望的输出:
print(my_upsample(df.sort_index()))
country members
year
1995-01-01 DE NaN
1996-01-01 DE 300.0
1997-01-01 DE NaN
1998-01-01 DE NaN
1999-01-01 DE 301.0
2000-01-01 DE NaN
1995-01-01 ES 100.0
1996-01-01 ES NaN
1997-01-01 ES 101.0
1998-01-01 ES NaN
1999-01-01 ES NaN
2000-01-01 ES NaN
1995-01-01 GB NaN
1996-01-01 GB NaN
1997-01-01 GB 200.0
1998-01-01 GB 201.0
1999-01-01 GB NaN
2000-01-01 GB 202.0
我相信有更好的方法,但这里有一种方法可以实现:
def my_upsample(df):
# Get all periods
years = df.index.get_level_values(1)
years = pd.date_range(years.min(), years.max(), freq="as")
# Reindex and format
return (
df.unstack(level=0)
.reindex(years)
.unstack()
.reset_index((0, 1), name="members")
.drop("level_0", axis=1)
)
输出:
country members
1995-01-01 DE NaN
1996-01-01 DE 300.0
1997-01-01 DE NaN
1998-01-01 DE NaN
1999-01-01 DE 301.0
2000-01-01 DE NaN
1995-01-01 ES 100.0
1996-01-01 ES NaN
1997-01-01 ES 101.0
1998-01-01 ES NaN
1999-01-01 ES NaN
2000-01-01 ES NaN
1995-01-01 GB NaN
1996-01-01 GB NaN
1997-01-01 GB 200.0
1998-01-01 GB 201.0
1999-01-01 GB NaN
2000-01-01 GB 202.0
以下是对每个步骤的一些解释:
unstack(level=0)
:旋转索引(level=0
部分将“年”设置为索引,从而允许传入重新索引)
reindex(years)
:重新索引到目标日期范围。请注意,在您的特定示例中,这实际上不是必需的,因为您的示例已经包含所有年份至少一次;
unstack()
:再次转向。由于索引不是 MultiIndex
,在此处旋转将 return 具有层次索引的 Series
:“成员”>“国家/地区”>“年”。在这个阶段我们基本上完成了,只需要将其格式化为所需的 DataFrame
;
reset_index((0, 1), name='members')
:只保留“年”作为索引,将原来的Series
重命名为“成员”;
drop('level_0', axis=1)
: 删除不需要的列
这里有一个虚拟的例子来说明这个问题。
我对升采样到每年年初 (AS
) 很感兴趣,对于每个国家/地区,我想涵盖从 1995 年到 2000 年的这段时间。
假设我们有以下数据集:
df = pd.DataFrame({
'year': [
'1995-01-01', '1997-01-01',
'1997-01-01', '1998-01-01', '2000-01-01',
'1996-01-01', '1999-01-01',
],
'country': [
'ES', 'ES',
'GB', 'GB', 'GB',
'DE', 'DE',
],
'members': [
100, 101,
200, 201, 202,
300, 301,
]
})
df['year']= pd.to_datetime(df['year'])
df = df.set_index(['country', 'year'])
print(df)
members
country year
ES 1995-01-01 100
1997-01-01 101
GB 1997-01-01 200
1998-01-01 201
2000-01-01 202
DE 1996-01-01 300
1999-01-01 301
如您所见,没有任何国家/地区拥有 1995 年至 2000 年之间所有年份的可用数据。请注意,一些国家/地区也缺少 1995 年,而其他国家/地区则缺少 2000 年。
我知道如何对数据框进行上采样,以便为每个国家/地区填充中间缺失的年份(例如,将 1996 年添加到西班牙)。
def my_upsample(df):
return (
df
.reset_index('country') # upsampling multi-index wit the keyword level is not supported
.groupby('country', group_keys=False) # hence this little trick of single-indexing & grouping
# see this issue for details: https://github.com/pandas-dev/pandas/issues/28313
.resample('AS') # resample to the beggining of each year
.apply({
'country':'pad', # pad the countries
'members':'asfreq', # but leave the number of members as NaN. Irrelevant in this dummy example, but
# the desired behaviour in my real-world problem
})
)
print(my_upsample(df))
country members
year
1996-01-01 DE 300.0
1997-01-01 DE NaN
1998-01-01 DE NaN
1999-01-01 DE 301.0
1995-01-01 ES 100.0
1996-01-01 ES NaN
1997-01-01 ES 101.0
1997-01-01 GB 200.0
1998-01-01 GB 201.0
1999-01-01 GB NaN
2000-01-01 GB 202.0
但我想做的是确保所有国家都涵盖从 1995 年到 2000 年的时期。
所需的输出应如下所示:
country members
year
1995-01-01 DE NaN
1996-01-01 DE 300.0
1997-01-01 DE NaN
1998-01-01 DE NaN
1999-01-01 DE 301.0
2000-01-01 DE NaN
1995-01-01 ES 100.0
1996-01-01 ES NaN
1997-01-01 ES 101.0
1998-01-01 ES NaN
1999-01-01 ES NaN
2000-01-01 ES NaN
1995-01-01 GB NaN
1996-01-01 GB NaN
1997-01-01 GB 200.0
1998-01-01 GB 201.0
1999-01-01 GB NaN
2000-01-01 GB 202.0
我可以使用 python 循环遍历每个国家并添加缺失的行(参见下面的代码),但我想知道实现此目的的 pandas 方法是什么?
for country in df.index.levels[0]:
if not (country, '1995-01-01') in df.query(f"country == @country").index:
# if this country doesn't have the year 1995 create the row with NaN as value
df.loc[(country, '1995-01-01'),:] = np.nan
if not (country, '2000-01-01') in df.query(f"country == @country").index:
# if this country doesn't have the year 2000 create the row with NaN as value
df.loc[(country, '2000-01-01'),:] = np.nan
print(df.sort_index())
members
country year
DE 1995-01-01 NaN
1996-01-01 300.0
1999-01-01 301.0
2000-01-01 NaN
ES 1995-01-01 100.0
1997-01-01 101.0
2000-01-01 NaN
GB 1995-01-01 NaN
1997-01-01 200.0
1998-01-01 201.0
2000-01-01 202.0
然后运行my_upsample
returns期望的输出:
print(my_upsample(df.sort_index()))
country members
year
1995-01-01 DE NaN
1996-01-01 DE 300.0
1997-01-01 DE NaN
1998-01-01 DE NaN
1999-01-01 DE 301.0
2000-01-01 DE NaN
1995-01-01 ES 100.0
1996-01-01 ES NaN
1997-01-01 ES 101.0
1998-01-01 ES NaN
1999-01-01 ES NaN
2000-01-01 ES NaN
1995-01-01 GB NaN
1996-01-01 GB NaN
1997-01-01 GB 200.0
1998-01-01 GB 201.0
1999-01-01 GB NaN
2000-01-01 GB 202.0
我相信有更好的方法,但这里有一种方法可以实现:
def my_upsample(df):
# Get all periods
years = df.index.get_level_values(1)
years = pd.date_range(years.min(), years.max(), freq="as")
# Reindex and format
return (
df.unstack(level=0)
.reindex(years)
.unstack()
.reset_index((0, 1), name="members")
.drop("level_0", axis=1)
)
输出:
country members
1995-01-01 DE NaN
1996-01-01 DE 300.0
1997-01-01 DE NaN
1998-01-01 DE NaN
1999-01-01 DE 301.0
2000-01-01 DE NaN
1995-01-01 ES 100.0
1996-01-01 ES NaN
1997-01-01 ES 101.0
1998-01-01 ES NaN
1999-01-01 ES NaN
2000-01-01 ES NaN
1995-01-01 GB NaN
1996-01-01 GB NaN
1997-01-01 GB 200.0
1998-01-01 GB 201.0
1999-01-01 GB NaN
2000-01-01 GB 202.0
以下是对每个步骤的一些解释:
unstack(level=0)
:旋转索引(level=0
部分将“年”设置为索引,从而允许传入重新索引)reindex(years)
:重新索引到目标日期范围。请注意,在您的特定示例中,这实际上不是必需的,因为您的示例已经包含所有年份至少一次;unstack()
:再次转向。由于索引不是MultiIndex
,在此处旋转将 return 具有层次索引的Series
:“成员”>“国家/地区”>“年”。在这个阶段我们基本上完成了,只需要将其格式化为所需的DataFrame
;reset_index((0, 1), name='members')
:只保留“年”作为索引,将原来的Series
重命名为“成员”;drop('level_0', axis=1)
: 删除不需要的列