pandas 中的高级数据帧重塑

advanced dataframe reshaping in pandas

我正在尝试按月重塑数据框,但没有成功。我有一个数据框,其中包含跨越给定时间段的数据:每月、每季度或每年。基本上我想按如下方式重塑数据框:一旦所有可用的月度数据都用完了,就使用季度值,然后一旦所有季度值都用完了,就使用年度值。你知道我该怎么做吗?

非常感谢您的帮助!

输入:

var_name        begin_delivery_date    end_delivery_date      value
Monthly 2022    2022-01-01T06:00:00    2022-02-01T05:59:59      5
Monthly 2022    2022-02-01T06:00:00    2022-03-01T05:59:59      7
   ...                 ...                  ...                ...
Quarterly 2022  2022-01-01T06:00:00    2022-04-01T06:00:00      10
   ...                 ...                  ...                ...
Yearly 2022     2022-01-01T06:00:00    2023-01-01T06:00:00     49

预期输出:

  date        var_name        value
  2022-01-01   Monthly 2022     5
  2022-02-01   Monthly 2022     7
  2022-03-01  Quarterly 2022    10
  2022-04-01  Yearly 2022       49
  2022-05-01  Yearly 2022       49
  2022-06-01  Yearly 2022       49 
  2022-07-01  Yearly 2022       49 
  2022-08-01  Yearly 2022       49
  2022-09-01  Yearly 2022       49 
  2022-10-01  Yearly 2022       49
  2022-11-01  Yearly 2022       49
  2022-12-01  Yearly 2022       49

输入要玩的数据:

 {  {
        "begin_delivery_date": "2022-01-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-02-01T05:59:59",
        "value": 5
    },
    {
        "begin_delivery_date": "2022-02-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-03-01T05:59:59",
        "value": 7
    },
    {
        "begin_delivery_date": "2022-03-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-04-01T05:59:59",
        "value": 8
    },
    {
        "begin_delivery_date": "2022-04-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-05-01T05:59:59",
        "value": 9
    },
    {
        "begin_delivery_date": "2022-04-01T06:00:00",
        "var name": "Quarterly 2022",
        "end_delivery_date": "2022-07-01T05:59:59",
        "value": 10
    },
    {
        "begin_delivery_date": "2022-07-01T06:00:00",
        "var name": "Quarterly 2022",
        "end_delivery_date": "2022-10-01T05:59:59",
        "value": 11
    },
    {
        "begin_delivery_date": "2022-09-01T06:00:00",
        "var name": "Quarterly 2022",
        "end_delivery_date": "2023-01-01T05:59:59",
        "value": 12
    },
    {
        "begin_delivery_date": "2023-01-01T06:00:00",
        "var name": "Yearly 2023",
        "end_delivery_date": "2024-01-01T05:59:59",
        "value": 50
    },
    {
        "begin_delivery_date": "2024-01-01T06:00:00",
        "var name": "Yearly 2024",
        "end_delivery_date": "2025-01-01T05:59:59",
        "value": 60
    }
 }

IIUC,

import pandas as pd
import numpy as np

data =  [ {
        "begin_delivery_date": "2022-01-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-02-01T05:59:59",
        "value": 5
    },
    {
        "begin_delivery_date": "2022-02-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-03-01T05:59:59",
        "value": 7
    },
    {
        "begin_delivery_date": "2022-03-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-04-01T05:59:59",
        "value": 8
    },
    {
        "begin_delivery_date": "2022-04-01T06:00:00",
        "var name": "Monthly 2022",
        "end_delivery_date": "2022-05-01T05:59:59",
        "value": 9
    },
    {
        "begin_delivery_date": "2022-04-01T06:00:00",
        "var name": "Quarterly 2022",
        "end_delivery_date": "2022-07-01T05:59:59",
        "value": 10
    },
    {
        "begin_delivery_date": "2022-07-01T06:00:00",
        "var name": "Quarterly 2022",
        "end_delivery_date": "2022-10-01T05:59:59",
        "value": 11
    },
    {
        "begin_delivery_date": "2022-09-01T06:00:00",
        "var name": "Quarterly 2022",
        "end_delivery_date": "2023-01-01T05:59:59",
        "value": 12
    },
    {
        "begin_delivery_date": "2023-01-01T06:00:00",
        "var name": "Yearly 2023",
        "end_delivery_date": "2024-01-01T05:59:59",
        "value": 50
    },
    {
        "begin_delivery_date": "2024-01-01T06:00:00",
        "var name": "Yearly 2024",
        "end_delivery_date": "2025-01-01T05:59:59",
        "value": 60
    }
]

df = pd.DataFrame(data)

根据日期范围创建日期列表并展开数据框。

df['dates'] = [pd.date_range(s, e, freq='M') for s, e in zip(df['begin_delivery_date'], df['end_delivery_date'])]

df_out = df.explode('dates')
print(df_out)

输出:

   begin_delivery_date        var name    end_delivery_date  value               dates
0  2022-01-01T06:00:00    Monthly 2022  2022-02-01T05:59:59      5 2022-01-31 06:00:00
1  2022-02-01T06:00:00    Monthly 2022  2022-03-01T05:59:59      7 2022-02-28 06:00:00
2  2022-03-01T06:00:00    Monthly 2022  2022-04-01T05:59:59      8 2022-03-31 06:00:00
3  2022-04-01T06:00:00    Monthly 2022  2022-05-01T05:59:59      9 2022-04-30 06:00:00
4  2022-04-01T06:00:00  Quarterly 2022  2022-07-01T05:59:59     10 2022-04-30 06:00:00
4  2022-04-01T06:00:00  Quarterly 2022  2022-07-01T05:59:59     10 2022-05-31 06:00:00
4  2022-04-01T06:00:00  Quarterly 2022  2022-07-01T05:59:59     10 2022-06-30 06:00:00
5  2022-07-01T06:00:00  Quarterly 2022  2022-10-01T05:59:59     11 2022-07-31 06:00:00
5  2022-07-01T06:00:00  Quarterly 2022  2022-10-01T05:59:59     11 2022-08-31 06:00:00
5  2022-07-01T06:00:00  Quarterly 2022  2022-10-01T05:59:59     11 2022-09-30 06:00:00
6  2022-09-01T06:00:00  Quarterly 2022  2023-01-01T05:59:59     12 2022-09-30 06:00:00
6  2022-09-01T06:00:00  Quarterly 2022  2023-01-01T05:59:59     12 2022-10-31 06:00:00
6  2022-09-01T06:00:00  Quarterly 2022  2023-01-01T05:59:59     12 2022-11-30 06:00:00
6  2022-09-01T06:00:00  Quarterly 2022  2023-01-01T05:59:59     12 2022-12-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-01-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-02-28 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-03-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-04-30 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-05-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-06-30 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-07-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-08-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-09-30 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-10-31 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-11-30 06:00:00
7  2023-01-01T06:00:00     Yearly 2023  2024-01-01T05:59:59     50 2023-12-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-01-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-02-29 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-03-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-04-30 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-05-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-06-30 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-07-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-08-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-09-30 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-10-31 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-11-30 06:00:00
8  2024-01-01T06:00:00     Yearly 2024  2025-01-01T05:59:59     60 2024-12-31 06:00:00

创建df并shuffle(数据就是你上面写的数据)

df = pd.DataFrame(data)
df = df.sample(frac=1).reset_index(drop=True)

将变量名称中的每个值拆分为 2 个单独的列,var_name_pediod 和 var_name_year

df["var_name_pediod"] = df["var name"].str.split(" ").str[0]
df["var_name_year"] = df["var name"].str.split(" ").str[1]

创建用于对句点进行排序的字典,并将“var_name_pediod”列替换为字典

sort_dic = {"Monthly":1,"Quarterly":2,"Yearly":3}
df["var_name_pediod"] = df["var_name_pediod"].replace(sort_dic)

按“var_name_pediod”列对值进行排序

df.sort_values(by=['var_name_pediod'], inplace=True)

Groupby var_name_pediod 并按“var_name_year

排序
df.groupby(['var_name_pediod']).apply(lambda x: x.sort_values(by=['var_name_year'])).reset_index(drop=True)

完成。如果不需要,请删除额外的列

df.drop(columns=["var_name_pediod","var_name_year"],inplace=True)