将每年订购的数据框更改为季节性订购的数据框

Change yearly ordered dataframe to seasonly orderd dataframe

在Pandas中,我想创建一些列,这些列将代表从 11 月开始到明年 10 月结束的季节(例如旅游季节)。

这是我的代码片段:

from numpy import random
import pandas as pd

np.random.seed(0)
df = pd.DataFrame({
    'date': pd.date_range('1990-01-01', freq='M', periods=12),
    'travel_2016': random.randint(10, size=(12)),
    'travel_2017': random.randint(10, size=(12)),
    'travel_2018': random.randint(10, size=(12)),
    'travel_2019': random.randint(10, size=(12)),
    'travel_2020': random.randint(10, size=(12))})

    df['month_date'] = df['date'].dt.strftime('%m')
    df = df.drop(columns = ['date'])

我正在尝试这种方法 我在 'unpivoting' 和 table 两种解决方案之后都失败了。对我来说,保持支点 table 以备将来操作会更容易。

我想要的输出是这样的:

    season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 month_date
0   8                7                7                4                11
1   0                1                4                8                12
2   1                4                5                9                01
3   8                3                5                7                02
4   4                7                8                3                03
5   6                8                4                4                04
6   5                8                3                1                05
7   7                0                1                1                06
8   1                2                1                3                07
9   8                9                7                5                08
10  7                7                7                8                09
11  9                1                4                0                10

非常感谢!

您的 table 已经按照您的意愿进行了格式化,大致是:您基本上是将所有行向下移动 2,并将最下面的 2 行移至开头 - 但移至下一年。

>>> year_ends = df.shift(-10)
>>> year_ends = year_ends.drop(columns=['month_date']).shift(axis='columns').join(year_ends['month_date'])
>>> year_ends
    travel_2016  travel_2017  travel_2018  travel_2019  travel_2020 month_date
0           NaN          7.0          8.0          3.0          2.0         11
1           NaN          6.0          9.0          3.0          7.0         12
2           NaN          NaN          NaN          NaN          NaN        NaN
3           NaN          NaN          NaN          NaN          NaN        NaN
4           NaN          NaN          NaN          NaN          NaN        NaN
5           NaN          NaN          NaN          NaN          NaN        NaN
6           NaN          NaN          NaN          NaN          NaN        NaN
7           NaN          NaN          NaN          NaN          NaN        NaN
8           NaN          NaN          NaN          NaN          NaN        NaN
9           NaN          NaN          NaN          NaN          NaN        NaN
10          NaN          NaN          NaN          NaN          NaN        NaN
11          NaN          NaN          NaN          NaN          NaN        NaN

剩下的很简单:

>>> seasons = df.shift(2).fillna(year_ends)
>>> seasons
    travel_2016  travel_2017  travel_2018  travel_2019  travel_2020 month_date
0           NaN          7.0          8.0          3.0          2.0         11
1           NaN          6.0          9.0          3.0          7.0         12
2           5.0          8.0          4.0          3.0          2.0         01
3           0.0          8.0          3.0          7.0          0.0         02
4           3.0          1.0          0.0          0.0          0.0         03
5           3.0          6.0          3.0          1.0          4.0         04
6           7.0          7.0          5.0          9.0          5.0         05
7           9.0          7.0          0.0          9.0          5.0         06
8           3.0          8.0          2.0          0.0          6.0         07
9           5.0          1.0          3.0          4.0          8.0         08
10          2.0          5.0          8.0          7.0          4.0         09
11          4.0          9.0          1.0          3.0          1.0         10

当然,您现在应该适当地重命名列:

>>> seasons.rename(columns=lambda c: c if not c.startswith('travel_') else f"season_{int(c[7:]) - 1}/{c[7:]}")
    season_2015/2016  season_2016/2017  season_2017/2018  season_2018/2019  season_2019/2020 month_date
0                NaN               7.0               8.0               3.0               2.0         11
1                NaN               6.0               9.0               3.0               7.0         12
2                5.0               8.0               4.0               3.0               2.0         01
3                0.0               8.0               3.0               7.0               0.0         02
4                3.0               1.0               0.0               0.0               0.0         03
5                3.0               6.0               3.0               1.0               4.0         04
6                7.0               7.0               5.0               9.0               5.0         05
7                9.0               7.0               0.0               9.0               5.0         06
8                3.0               8.0               2.0               0.0               6.0         07
9                5.0               1.0               3.0               4.0               8.0         08
10               2.0               5.0               8.0               7.0               4.0         09
11               4.0               9.0               1.0               3.0               1.0         10

请注意,2015 年的前两个值为 NaN,这是有道理的,因为它们不在初始数据框中。


另一种方法是使用日期时间工具。这可能更通用:

>>> data = df.set_index('month_date').rename_axis('year', axis='columns').stack().reset_index(name='data')
>>> data.head()
  month_date         year  data
0         01  travel_2016     5
1         01  travel_2017     8
2         01  travel_2018     4
3         01  travel_2019     3
4         01  travel_2020     2
>>> dates = data['year'].str[7:].str.cat(data['month_date']).transform(pd.to_datetime, format='%Y%m')
>>> dates.head()
0   2016-01-01
1   2017-01-01
2   2018-01-01
3   2019-01-01
4   2020-01-01
Name: year, dtype: datetime64[ns]

然后按照链接问题获取从 11 月开始的财政年度:

>>> season = dates.dt.to_period('Q-OCT').dt.qyear.rename('season')
>>> seasonal_data = data.join(season).pivot('month_date', 'season', 'data')
>>> seasonal_data.rename(columns=lambda c: f"season_{c - 1}/{c}", inplace=True)
>>> seasonal_data.reindex([*df['month_date'][-2:], *df['month_date'][:-2]]).reset_index()
season month_date  season_2015/2016  season_2016/2017  season_2017/2018  season_2018/2019  season_2019/2020  season_2020/2021
0              11               NaN               7.0               8.0               3.0               2.0               4.0
1              12               NaN               6.0               9.0               3.0               7.0               9.0
2              01               5.0               8.0               4.0               3.0               2.0               NaN
3              02               0.0               8.0               3.0               7.0               0.0               NaN
4              03               3.0               1.0               0.0               0.0               0.0               NaN
5              04               3.0               6.0               3.0               1.0               4.0               NaN
6              05               7.0               7.0               5.0               9.0               5.0               NaN
7              06               9.0               7.0               0.0               9.0               5.0               NaN
8              07               3.0               8.0               2.0               0.0               6.0               NaN
9              08               5.0               1.0               3.0               4.0               8.0               NaN
10             09               2.0               5.0               8.0               7.0               4.0               NaN
11             10               4.0               9.0               1.0               3.0               1.0               NaN