pandas 中的高级数据帧重塑
advanced dataframe reshaping in pandas
我正在尝试按月重塑数据框,但没有成功。我有一个数据框,其中包含跨越给定时间段的数据:每月、每季度或每年。基本上我想按如下方式重塑数据框:一旦所有可用的月度数据都用完了,就使用季度值,然后一旦所有季度值都用完了,就使用年度值。你知道我该怎么做吗?
非常感谢您的帮助!
输入:
var_name begin_delivery_date end_delivery_date value
Monthly 2022 2022-01-01T06:00:00 2022-02-01T05:59:59 5
Monthly 2022 2022-02-01T06:00:00 2022-03-01T05:59:59 7
... ... ... ...
Quarterly 2022 2022-01-01T06:00:00 2022-04-01T06:00:00 10
... ... ... ...
Yearly 2022 2022-01-01T06:00:00 2023-01-01T06:00:00 49
预期输出:
date var_name value
2022-01-01 Monthly 2022 5
2022-02-01 Monthly 2022 7
2022-03-01 Quarterly 2022 10
2022-04-01 Yearly 2022 49
2022-05-01 Yearly 2022 49
2022-06-01 Yearly 2022 49
2022-07-01 Yearly 2022 49
2022-08-01 Yearly 2022 49
2022-09-01 Yearly 2022 49
2022-10-01 Yearly 2022 49
2022-11-01 Yearly 2022 49
2022-12-01 Yearly 2022 49
输入要玩的数据:
{ {
"begin_delivery_date": "2022-01-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-02-01T05:59:59",
"value": 5
},
{
"begin_delivery_date": "2022-02-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-03-01T05:59:59",
"value": 7
},
{
"begin_delivery_date": "2022-03-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-04-01T05:59:59",
"value": 8
},
{
"begin_delivery_date": "2022-04-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-05-01T05:59:59",
"value": 9
},
{
"begin_delivery_date": "2022-04-01T06:00:00",
"var name": "Quarterly 2022",
"end_delivery_date": "2022-07-01T05:59:59",
"value": 10
},
{
"begin_delivery_date": "2022-07-01T06:00:00",
"var name": "Quarterly 2022",
"end_delivery_date": "2022-10-01T05:59:59",
"value": 11
},
{
"begin_delivery_date": "2022-09-01T06:00:00",
"var name": "Quarterly 2022",
"end_delivery_date": "2023-01-01T05:59:59",
"value": 12
},
{
"begin_delivery_date": "2023-01-01T06:00:00",
"var name": "Yearly 2023",
"end_delivery_date": "2024-01-01T05:59:59",
"value": 50
},
{
"begin_delivery_date": "2024-01-01T06:00:00",
"var name": "Yearly 2024",
"end_delivery_date": "2025-01-01T05:59:59",
"value": 60
}
}
IIUC,
import pandas as pd
import numpy as np
data = [ {
"begin_delivery_date": "2022-01-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-02-01T05:59:59",
"value": 5
},
{
"begin_delivery_date": "2022-02-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-03-01T05:59:59",
"value": 7
},
{
"begin_delivery_date": "2022-03-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-04-01T05:59:59",
"value": 8
},
{
"begin_delivery_date": "2022-04-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-05-01T05:59:59",
"value": 9
},
{
"begin_delivery_date": "2022-04-01T06:00:00",
"var name": "Quarterly 2022",
"end_delivery_date": "2022-07-01T05:59:59",
"value": 10
},
{
"begin_delivery_date": "2022-07-01T06:00:00",
"var name": "Quarterly 2022",
"end_delivery_date": "2022-10-01T05:59:59",
"value": 11
},
{
"begin_delivery_date": "2022-09-01T06:00:00",
"var name": "Quarterly 2022",
"end_delivery_date": "2023-01-01T05:59:59",
"value": 12
},
{
"begin_delivery_date": "2023-01-01T06:00:00",
"var name": "Yearly 2023",
"end_delivery_date": "2024-01-01T05:59:59",
"value": 50
},
{
"begin_delivery_date": "2024-01-01T06:00:00",
"var name": "Yearly 2024",
"end_delivery_date": "2025-01-01T05:59:59",
"value": 60
}
]
df = pd.DataFrame(data)
根据日期范围创建日期列表并展开数据框。
df['dates'] = [pd.date_range(s, e, freq='M') for s, e in zip(df['begin_delivery_date'], df['end_delivery_date'])]
df_out = df.explode('dates')
print(df_out)
输出:
begin_delivery_date var name end_delivery_date value dates
0 2022-01-01T06:00:00 Monthly 2022 2022-02-01T05:59:59 5 2022-01-31 06:00:00
1 2022-02-01T06:00:00 Monthly 2022 2022-03-01T05:59:59 7 2022-02-28 06:00:00
2 2022-03-01T06:00:00 Monthly 2022 2022-04-01T05:59:59 8 2022-03-31 06:00:00
3 2022-04-01T06:00:00 Monthly 2022 2022-05-01T05:59:59 9 2022-04-30 06:00:00
4 2022-04-01T06:00:00 Quarterly 2022 2022-07-01T05:59:59 10 2022-04-30 06:00:00
4 2022-04-01T06:00:00 Quarterly 2022 2022-07-01T05:59:59 10 2022-05-31 06:00:00
4 2022-04-01T06:00:00 Quarterly 2022 2022-07-01T05:59:59 10 2022-06-30 06:00:00
5 2022-07-01T06:00:00 Quarterly 2022 2022-10-01T05:59:59 11 2022-07-31 06:00:00
5 2022-07-01T06:00:00 Quarterly 2022 2022-10-01T05:59:59 11 2022-08-31 06:00:00
5 2022-07-01T06:00:00 Quarterly 2022 2022-10-01T05:59:59 11 2022-09-30 06:00:00
6 2022-09-01T06:00:00 Quarterly 2022 2023-01-01T05:59:59 12 2022-09-30 06:00:00
6 2022-09-01T06:00:00 Quarterly 2022 2023-01-01T05:59:59 12 2022-10-31 06:00:00
6 2022-09-01T06:00:00 Quarterly 2022 2023-01-01T05:59:59 12 2022-11-30 06:00:00
6 2022-09-01T06:00:00 Quarterly 2022 2023-01-01T05:59:59 12 2022-12-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-01-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-02-28 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-03-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-04-30 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-05-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-06-30 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-07-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-08-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-09-30 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-10-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-11-30 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-12-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-01-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-02-29 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-03-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-04-30 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-05-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-06-30 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-07-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-08-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-09-30 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-10-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-11-30 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-12-31 06:00:00
创建df并shuffle(数据就是你上面写的数据)
df = pd.DataFrame(data)
df = df.sample(frac=1).reset_index(drop=True)
将变量名称中的每个值拆分为 2 个单独的列,var_name_pediod 和 var_name_year
df["var_name_pediod"] = df["var name"].str.split(" ").str[0]
df["var_name_year"] = df["var name"].str.split(" ").str[1]
创建用于对句点进行排序的字典,并将“var_name_pediod”列替换为字典
sort_dic = {"Monthly":1,"Quarterly":2,"Yearly":3}
df["var_name_pediod"] = df["var_name_pediod"].replace(sort_dic)
按“var_name_pediod”列对值进行排序
df.sort_values(by=['var_name_pediod'], inplace=True)
Groupby var_name_pediod 并按“var_name_year
排序
df.groupby(['var_name_pediod']).apply(lambda x: x.sort_values(by=['var_name_year'])).reset_index(drop=True)
完成。如果不需要,请删除额外的列
df.drop(columns=["var_name_pediod","var_name_year"],inplace=True)
我正在尝试按月重塑数据框,但没有成功。我有一个数据框,其中包含跨越给定时间段的数据:每月、每季度或每年。基本上我想按如下方式重塑数据框:一旦所有可用的月度数据都用完了,就使用季度值,然后一旦所有季度值都用完了,就使用年度值。你知道我该怎么做吗?
非常感谢您的帮助!
输入:
var_name begin_delivery_date end_delivery_date value
Monthly 2022 2022-01-01T06:00:00 2022-02-01T05:59:59 5
Monthly 2022 2022-02-01T06:00:00 2022-03-01T05:59:59 7
... ... ... ...
Quarterly 2022 2022-01-01T06:00:00 2022-04-01T06:00:00 10
... ... ... ...
Yearly 2022 2022-01-01T06:00:00 2023-01-01T06:00:00 49
预期输出:
date var_name value
2022-01-01 Monthly 2022 5
2022-02-01 Monthly 2022 7
2022-03-01 Quarterly 2022 10
2022-04-01 Yearly 2022 49
2022-05-01 Yearly 2022 49
2022-06-01 Yearly 2022 49
2022-07-01 Yearly 2022 49
2022-08-01 Yearly 2022 49
2022-09-01 Yearly 2022 49
2022-10-01 Yearly 2022 49
2022-11-01 Yearly 2022 49
2022-12-01 Yearly 2022 49
输入要玩的数据:
{ {
"begin_delivery_date": "2022-01-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-02-01T05:59:59",
"value": 5
},
{
"begin_delivery_date": "2022-02-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-03-01T05:59:59",
"value": 7
},
{
"begin_delivery_date": "2022-03-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-04-01T05:59:59",
"value": 8
},
{
"begin_delivery_date": "2022-04-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-05-01T05:59:59",
"value": 9
},
{
"begin_delivery_date": "2022-04-01T06:00:00",
"var name": "Quarterly 2022",
"end_delivery_date": "2022-07-01T05:59:59",
"value": 10
},
{
"begin_delivery_date": "2022-07-01T06:00:00",
"var name": "Quarterly 2022",
"end_delivery_date": "2022-10-01T05:59:59",
"value": 11
},
{
"begin_delivery_date": "2022-09-01T06:00:00",
"var name": "Quarterly 2022",
"end_delivery_date": "2023-01-01T05:59:59",
"value": 12
},
{
"begin_delivery_date": "2023-01-01T06:00:00",
"var name": "Yearly 2023",
"end_delivery_date": "2024-01-01T05:59:59",
"value": 50
},
{
"begin_delivery_date": "2024-01-01T06:00:00",
"var name": "Yearly 2024",
"end_delivery_date": "2025-01-01T05:59:59",
"value": 60
}
}
IIUC,
import pandas as pd
import numpy as np
data = [ {
"begin_delivery_date": "2022-01-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-02-01T05:59:59",
"value": 5
},
{
"begin_delivery_date": "2022-02-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-03-01T05:59:59",
"value": 7
},
{
"begin_delivery_date": "2022-03-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-04-01T05:59:59",
"value": 8
},
{
"begin_delivery_date": "2022-04-01T06:00:00",
"var name": "Monthly 2022",
"end_delivery_date": "2022-05-01T05:59:59",
"value": 9
},
{
"begin_delivery_date": "2022-04-01T06:00:00",
"var name": "Quarterly 2022",
"end_delivery_date": "2022-07-01T05:59:59",
"value": 10
},
{
"begin_delivery_date": "2022-07-01T06:00:00",
"var name": "Quarterly 2022",
"end_delivery_date": "2022-10-01T05:59:59",
"value": 11
},
{
"begin_delivery_date": "2022-09-01T06:00:00",
"var name": "Quarterly 2022",
"end_delivery_date": "2023-01-01T05:59:59",
"value": 12
},
{
"begin_delivery_date": "2023-01-01T06:00:00",
"var name": "Yearly 2023",
"end_delivery_date": "2024-01-01T05:59:59",
"value": 50
},
{
"begin_delivery_date": "2024-01-01T06:00:00",
"var name": "Yearly 2024",
"end_delivery_date": "2025-01-01T05:59:59",
"value": 60
}
]
df = pd.DataFrame(data)
根据日期范围创建日期列表并展开数据框。
df['dates'] = [pd.date_range(s, e, freq='M') for s, e in zip(df['begin_delivery_date'], df['end_delivery_date'])]
df_out = df.explode('dates')
print(df_out)
输出:
begin_delivery_date var name end_delivery_date value dates
0 2022-01-01T06:00:00 Monthly 2022 2022-02-01T05:59:59 5 2022-01-31 06:00:00
1 2022-02-01T06:00:00 Monthly 2022 2022-03-01T05:59:59 7 2022-02-28 06:00:00
2 2022-03-01T06:00:00 Monthly 2022 2022-04-01T05:59:59 8 2022-03-31 06:00:00
3 2022-04-01T06:00:00 Monthly 2022 2022-05-01T05:59:59 9 2022-04-30 06:00:00
4 2022-04-01T06:00:00 Quarterly 2022 2022-07-01T05:59:59 10 2022-04-30 06:00:00
4 2022-04-01T06:00:00 Quarterly 2022 2022-07-01T05:59:59 10 2022-05-31 06:00:00
4 2022-04-01T06:00:00 Quarterly 2022 2022-07-01T05:59:59 10 2022-06-30 06:00:00
5 2022-07-01T06:00:00 Quarterly 2022 2022-10-01T05:59:59 11 2022-07-31 06:00:00
5 2022-07-01T06:00:00 Quarterly 2022 2022-10-01T05:59:59 11 2022-08-31 06:00:00
5 2022-07-01T06:00:00 Quarterly 2022 2022-10-01T05:59:59 11 2022-09-30 06:00:00
6 2022-09-01T06:00:00 Quarterly 2022 2023-01-01T05:59:59 12 2022-09-30 06:00:00
6 2022-09-01T06:00:00 Quarterly 2022 2023-01-01T05:59:59 12 2022-10-31 06:00:00
6 2022-09-01T06:00:00 Quarterly 2022 2023-01-01T05:59:59 12 2022-11-30 06:00:00
6 2022-09-01T06:00:00 Quarterly 2022 2023-01-01T05:59:59 12 2022-12-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-01-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-02-28 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-03-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-04-30 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-05-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-06-30 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-07-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-08-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-09-30 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-10-31 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-11-30 06:00:00
7 2023-01-01T06:00:00 Yearly 2023 2024-01-01T05:59:59 50 2023-12-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-01-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-02-29 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-03-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-04-30 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-05-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-06-30 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-07-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-08-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-09-30 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-10-31 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-11-30 06:00:00
8 2024-01-01T06:00:00 Yearly 2024 2025-01-01T05:59:59 60 2024-12-31 06:00:00
创建df并shuffle(数据就是你上面写的数据)
df = pd.DataFrame(data)
df = df.sample(frac=1).reset_index(drop=True)
将变量名称中的每个值拆分为 2 个单独的列,var_name_pediod 和 var_name_year
df["var_name_pediod"] = df["var name"].str.split(" ").str[0]
df["var_name_year"] = df["var name"].str.split(" ").str[1]
创建用于对句点进行排序的字典,并将“var_name_pediod”列替换为字典
sort_dic = {"Monthly":1,"Quarterly":2,"Yearly":3}
df["var_name_pediod"] = df["var_name_pediod"].replace(sort_dic)
按“var_name_pediod”列对值进行排序
df.sort_values(by=['var_name_pediod'], inplace=True)
Groupby var_name_pediod 并按“var_name_year
排序df.groupby(['var_name_pediod']).apply(lambda x: x.sort_values(by=['var_name_year'])).reset_index(drop=True)
完成。如果不需要,请删除额外的列
df.drop(columns=["var_name_pediod","var_name_year"],inplace=True)