将每年订购的数据框更改为季节性订购的数据框
Change yearly ordered dataframe to seasonly orderd dataframe
在Pandas中,我想创建一些列,这些列将代表从 11 月开始到明年 10 月结束的季节(例如旅游季节)。
这是我的代码片段:
from numpy import random
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({
'date': pd.date_range('1990-01-01', freq='M', periods=12),
'travel_2016': random.randint(10, size=(12)),
'travel_2017': random.randint(10, size=(12)),
'travel_2018': random.randint(10, size=(12)),
'travel_2019': random.randint(10, size=(12)),
'travel_2020': random.randint(10, size=(12))})
df['month_date'] = df['date'].dt.strftime('%m')
df = df.drop(columns = ['date'])
我正在尝试这种方法
我在 'unpivoting' 和 table 两种解决方案之后都失败了。对我来说,保持支点 table 以备将来操作会更容易。
我想要的输出是这样的:
season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 month_date
0 8 7 7 4 11
1 0 1 4 8 12
2 1 4 5 9 01
3 8 3 5 7 02
4 4 7 8 3 03
5 6 8 4 4 04
6 5 8 3 1 05
7 7 0 1 1 06
8 1 2 1 3 07
9 8 9 7 5 08
10 7 7 7 8 09
11 9 1 4 0 10
非常感谢!
您的 table 已经按照您的意愿进行了格式化,大致是:您基本上是将所有行向下移动 2,并将最下面的 2 行移至开头 - 但移至下一年。
>>> year_ends = df.shift(-10)
>>> year_ends = year_ends.drop(columns=['month_date']).shift(axis='columns').join(year_ends['month_date'])
>>> year_ends
travel_2016 travel_2017 travel_2018 travel_2019 travel_2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN
剩下的很简单:
>>> seasons = df.shift(2).fillna(year_ends)
>>> seasons
travel_2016 travel_2017 travel_2018 travel_2019 travel_2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 5.0 8.0 4.0 3.0 2.0 01
3 0.0 8.0 3.0 7.0 0.0 02
4 3.0 1.0 0.0 0.0 0.0 03
5 3.0 6.0 3.0 1.0 4.0 04
6 7.0 7.0 5.0 9.0 5.0 05
7 9.0 7.0 0.0 9.0 5.0 06
8 3.0 8.0 2.0 0.0 6.0 07
9 5.0 1.0 3.0 4.0 8.0 08
10 2.0 5.0 8.0 7.0 4.0 09
11 4.0 9.0 1.0 3.0 1.0 10
当然,您现在应该适当地重命名列:
>>> seasons.rename(columns=lambda c: c if not c.startswith('travel_') else f"season_{int(c[7:]) - 1}/{c[7:]}")
season_2015/2016 season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 5.0 8.0 4.0 3.0 2.0 01
3 0.0 8.0 3.0 7.0 0.0 02
4 3.0 1.0 0.0 0.0 0.0 03
5 3.0 6.0 3.0 1.0 4.0 04
6 7.0 7.0 5.0 9.0 5.0 05
7 9.0 7.0 0.0 9.0 5.0 06
8 3.0 8.0 2.0 0.0 6.0 07
9 5.0 1.0 3.0 4.0 8.0 08
10 2.0 5.0 8.0 7.0 4.0 09
11 4.0 9.0 1.0 3.0 1.0 10
请注意,2015 年的前两个值为 NaN,这是有道理的,因为它们不在初始数据框中。
另一种方法是使用日期时间工具。这可能更通用:
>>> data = df.set_index('month_date').rename_axis('year', axis='columns').stack().reset_index(name='data')
>>> data.head()
month_date year data
0 01 travel_2016 5
1 01 travel_2017 8
2 01 travel_2018 4
3 01 travel_2019 3
4 01 travel_2020 2
>>> dates = data['year'].str[7:].str.cat(data['month_date']).transform(pd.to_datetime, format='%Y%m')
>>> dates.head()
0 2016-01-01
1 2017-01-01
2 2018-01-01
3 2019-01-01
4 2020-01-01
Name: year, dtype: datetime64[ns]
然后按照链接问题获取从 11 月开始的财政年度:
>>> season = dates.dt.to_period('Q-OCT').dt.qyear.rename('season')
>>> seasonal_data = data.join(season).pivot('month_date', 'season', 'data')
>>> seasonal_data.rename(columns=lambda c: f"season_{c - 1}/{c}", inplace=True)
>>> seasonal_data.reindex([*df['month_date'][-2:], *df['month_date'][:-2]]).reset_index()
season month_date season_2015/2016 season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 season_2020/2021
0 11 NaN 7.0 8.0 3.0 2.0 4.0
1 12 NaN 6.0 9.0 3.0 7.0 9.0
2 01 5.0 8.0 4.0 3.0 2.0 NaN
3 02 0.0 8.0 3.0 7.0 0.0 NaN
4 03 3.0 1.0 0.0 0.0 0.0 NaN
5 04 3.0 6.0 3.0 1.0 4.0 NaN
6 05 7.0 7.0 5.0 9.0 5.0 NaN
7 06 9.0 7.0 0.0 9.0 5.0 NaN
8 07 3.0 8.0 2.0 0.0 6.0 NaN
9 08 5.0 1.0 3.0 4.0 8.0 NaN
10 09 2.0 5.0 8.0 7.0 4.0 NaN
11 10 4.0 9.0 1.0 3.0 1.0 NaN
在Pandas中,我想创建一些列,这些列将代表从 11 月开始到明年 10 月结束的季节(例如旅游季节)。
这是我的代码片段:
from numpy import random
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({
'date': pd.date_range('1990-01-01', freq='M', periods=12),
'travel_2016': random.randint(10, size=(12)),
'travel_2017': random.randint(10, size=(12)),
'travel_2018': random.randint(10, size=(12)),
'travel_2019': random.randint(10, size=(12)),
'travel_2020': random.randint(10, size=(12))})
df['month_date'] = df['date'].dt.strftime('%m')
df = df.drop(columns = ['date'])
我正在尝试这种方法
我想要的输出是这样的:
season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 month_date
0 8 7 7 4 11
1 0 1 4 8 12
2 1 4 5 9 01
3 8 3 5 7 02
4 4 7 8 3 03
5 6 8 4 4 04
6 5 8 3 1 05
7 7 0 1 1 06
8 1 2 1 3 07
9 8 9 7 5 08
10 7 7 7 8 09
11 9 1 4 0 10
非常感谢!
您的 table 已经按照您的意愿进行了格式化,大致是:您基本上是将所有行向下移动 2,并将最下面的 2 行移至开头 - 但移至下一年。
>>> year_ends = df.shift(-10)
>>> year_ends = year_ends.drop(columns=['month_date']).shift(axis='columns').join(year_ends['month_date'])
>>> year_ends
travel_2016 travel_2017 travel_2018 travel_2019 travel_2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN
剩下的很简单:
>>> seasons = df.shift(2).fillna(year_ends)
>>> seasons
travel_2016 travel_2017 travel_2018 travel_2019 travel_2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 5.0 8.0 4.0 3.0 2.0 01
3 0.0 8.0 3.0 7.0 0.0 02
4 3.0 1.0 0.0 0.0 0.0 03
5 3.0 6.0 3.0 1.0 4.0 04
6 7.0 7.0 5.0 9.0 5.0 05
7 9.0 7.0 0.0 9.0 5.0 06
8 3.0 8.0 2.0 0.0 6.0 07
9 5.0 1.0 3.0 4.0 8.0 08
10 2.0 5.0 8.0 7.0 4.0 09
11 4.0 9.0 1.0 3.0 1.0 10
当然,您现在应该适当地重命名列:
>>> seasons.rename(columns=lambda c: c if not c.startswith('travel_') else f"season_{int(c[7:]) - 1}/{c[7:]}")
season_2015/2016 season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 5.0 8.0 4.0 3.0 2.0 01
3 0.0 8.0 3.0 7.0 0.0 02
4 3.0 1.0 0.0 0.0 0.0 03
5 3.0 6.0 3.0 1.0 4.0 04
6 7.0 7.0 5.0 9.0 5.0 05
7 9.0 7.0 0.0 9.0 5.0 06
8 3.0 8.0 2.0 0.0 6.0 07
9 5.0 1.0 3.0 4.0 8.0 08
10 2.0 5.0 8.0 7.0 4.0 09
11 4.0 9.0 1.0 3.0 1.0 10
请注意,2015 年的前两个值为 NaN,这是有道理的,因为它们不在初始数据框中。
另一种方法是使用日期时间工具。这可能更通用:
>>> data = df.set_index('month_date').rename_axis('year', axis='columns').stack().reset_index(name='data')
>>> data.head()
month_date year data
0 01 travel_2016 5
1 01 travel_2017 8
2 01 travel_2018 4
3 01 travel_2019 3
4 01 travel_2020 2
>>> dates = data['year'].str[7:].str.cat(data['month_date']).transform(pd.to_datetime, format='%Y%m')
>>> dates.head()
0 2016-01-01
1 2017-01-01
2 2018-01-01
3 2019-01-01
4 2020-01-01
Name: year, dtype: datetime64[ns]
然后按照链接问题获取从 11 月开始的财政年度:
>>> season = dates.dt.to_period('Q-OCT').dt.qyear.rename('season')
>>> seasonal_data = data.join(season).pivot('month_date', 'season', 'data')
>>> seasonal_data.rename(columns=lambda c: f"season_{c - 1}/{c}", inplace=True)
>>> seasonal_data.reindex([*df['month_date'][-2:], *df['month_date'][:-2]]).reset_index()
season month_date season_2015/2016 season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 season_2020/2021
0 11 NaN 7.0 8.0 3.0 2.0 4.0
1 12 NaN 6.0 9.0 3.0 7.0 9.0
2 01 5.0 8.0 4.0 3.0 2.0 NaN
3 02 0.0 8.0 3.0 7.0 0.0 NaN
4 03 3.0 1.0 0.0 0.0 0.0 NaN
5 04 3.0 6.0 3.0 1.0 4.0 NaN
6 05 7.0 7.0 5.0 9.0 5.0 NaN
7 06 9.0 7.0 0.0 9.0 5.0 NaN
8 07 3.0 8.0 2.0 0.0 6.0 NaN
9 08 5.0 1.0 3.0 4.0 8.0 NaN
10 09 2.0 5.0 8.0 7.0 4.0 NaN
11 10 4.0 9.0 1.0 3.0 1.0 NaN