pandas 中多列月度、季度和年度级别的数据操作

Data manipulation in pandas on monthly, quarterly and annual level on multiple columns

我需要创建一个函数,它将输入作为字典并更新数据框中的列值。我的数据如下

Date Col_1 Col_2 Col_3 Col_4 Col_5
01/01/2021 10 20 10 20 10
02/01/2021 10 20 10 20 10
03/01/2021 10 20 10 20 10
04/01/2021 10 20 10 20 10
05/01/2021 10 20 10 20 10
06/01/2021 10 20 10 20 10
07/01/2021 10 20 10 20 10
08/01/2021 10 20 10 20 10
09/01/2021 10 20 10 20 10
10/01/2021 10 20 10 20 10
11/01/2021 10 20 10 20 10
12/01/2021 10 20 10 20 10

现在,如果通过 'Col_1' 和 'Col_2' 的每月级别更新百分比,比如

{Date: ['01/01/2021','02/01/2021','03/01/2021','04/01/2021','05/01/2021','06/01/2021',
        '07/01/2021','08/01/2021','09/01/2021','10/01/2021','11/01/2021','12/01/2021',],
 'Col_1': [20,20,20,20,30,30,40,40,20,20,20,20],
 'Col_2': [0,0,0,0,0,0,0,0,0,0,10,10]}

执行此操作后,我想要的每月更改如下所示

Date Col_1 Col_2 Col_3 Col_4 Col_5
01/01/2021 12 20 10 20 10
02/01/2021 12 20 10 20 10
03/01/2021 12 20 10 20 10
04/01/2021 12 20 10 20 10
05/01/2021 13 20 10 20 10
06/01/2021 13 20 10 20 10
07/01/2021 14 20 10 20 10
08/01/2021 14 20 10 20 10
09/01/2021 12 20 10 20 10
10/01/2021 12 20 10 20 10
11/01/2021 12 24 10 20 10
12/01/2021 12 24 10 20 10

同样,我也想更新季度和年度级别的数据。我能够进行年度更新,这是我的代码。请帮助我根据输入进行每月和每季度的更新。

谢谢!!

dic = {'col_1':10,'col_2':-5)
year = 2021
def update_df(dic,df,year):
    df = df[df['date'].dt.year == year]
    df = (df+df.select_dtypes(include = 'number').mul(pd.Series(dic)/100)).combine_first(df)[df.columns]
    return df

我正在尝试这样

def update_df(dic,df,year,choice):     
    if choice == annual:         
        df = df[df['date'].dt.year == year]         
        df = (df+df.select_dtypes(include =                 
'number').mul(pd.Series(dic)/100)).combine_first(df)[df.columns]        
    elif choice == quarterly :          
        df["quarter"] = df.date.dt.quarter           
        df = (df+df.select_dtypes(include =                   
        'number').mul(pd.Series(dic)/100)).combine_first(df)[df.columns]
    else choice == monthly : 
        df["month"] = df.date.dt.month           
        df = (df+df.select_dtypes(include =                   
        'number').mul(pd.Series(dic)/100)).combine_first(df)[df.columns]
    return df

当然可能有更简洁的方法,但下面的方法将起作用并提供一个函数来进行年度、季度或月度更新,如下所示:

import pandas as pd
from collections import namedtuple

# Control tuple defining the date parameters for changing dataframe
DateControl = namedtuple('DateControl', ['Year', 'Quarter', 'Month'])


def updateFrame(df:pd.DataFrame, pcnt_val: float, **args) -> pd.DataFrame:
    # Function to update a specified year, quarter of Month by pcnt_val amount
    dtecol = args.pop('DTECOL', None)
    colList = args.pop('Columns', [])
    control = DateControl(args.pop('Year', None),
                          args.pop('Quarter', None),
                          args.pop('Month', None)
                         )
    
    def EvalDate(ds: pd.Series, row: int, selection: DateControl) -> bool:
        # Evaluate the truth of a date based on control arguments
        yr = False
        qtr = False
        mnth = False
        if selection.Year is None:
            yr = True
        else:
            if ds[row].year == selection.Year:
                yr = True
        if selection.Quarter is None:
            qtr = True
        else:
            if ds[row].quarter == selection.Quarter:
                qtr = True
        if selection.Month is None:
            mnth = True
        else:
            if ds[row].month == selection.Month:
                mnth = True
        return yr and qtr and mnth
    
    # Use control to update all cols named in colList
    mask = list(EvalDate(df[dtecol], x, control) for x in range(len(df[dtecol])))
    mod = list((1.0 + pcnt_val) if x else 1.0 for x in mask)
    print(mask)
    print(mod)
    for c in colList:
         df[c] = list(df.iloc[x][c] * mod[x] for x in range(len(df[c])))     
    return df    

updateFrame 函数有两个位置参数:
. df - 要更新的数据框
. pcnt_val - 要添加到当前值的百分比

该函数还需要一些关键字变量来包括:

  • DTECOL - 这是包含日期的 df 列的名称
  • 列 - Df 中要更改的列标题列表
  • Year - 年份值或 None 如果要更改所有年份
  • 季度 - 特定的季度整数 1 到 4(含)或 None
  • 月份 - 要更改的特定月份或 None

将此函数应用于您的数据框 df,如下所示:

dg = updateFrame(df, .25, DTECOL='Date', Columns=['Col_1', 'Col_2'], Year=2021, Quarter=3)  

产量:

    Date    Col_1   Col_2   Col_3   Col_4   Col_5
0   2021-01-01  10.0    20.0    10  20  10
1   2021-02-01  10.0    20.0    10  20  10
2   2021-03-01  10.0    20.0    10  20  10
3   2021-04-01  10.0    20.0    10  20  10
4   2021-05-01  10.0    20.0    10  20  10
5   2021-06-01  10.0    20.0    10  20  10
6   2021-07-01  12.5    25.0    10  20  10
7   2021-08-01  12.5    25.0    10  20  10
8   2021-09-01  12.5    25.0    10  20  10
9   2021-10-01  10.0    20.0    10  20  10
10  2021-11-01  10.0    20.0    10  20  10
11  2021-12-01  10.0    20.0    10  20  10

鉴于您希望在一个电话中提供所有 4 个季度的更新,我会这样做: 添加新功能:

def updateByQuarter(df:pd.DataFrame, changes: list, **args) -> pd.DataFrame:
    #  Given a quarterly change list of the form tuple(qtrid, chgval) Update the dataframe
    for chg in changes:
        args['Quarter'] = chg[0]
        df updateFrame(df, chg[1], **args)
    return df    

然后按季度创建变更列表

# List of tuples defining the quarter and percent change
qtrChg = [(1, 0.02),(2, 0.035),(3, -0.018),(4, 0.125)]  

用途:

df = updateByQuarter(df, [(1, 0.02), (2, 0.04), (3, -0.02), (4, 0.15)], DTECOL='Date', Columns=['Col_1', 'Col_2'])  

这产生:

         Date  Col_1  Col_2  Col_3  Col_4  Col_5
0  2021-01-01   10.2   20.4     10     20     10
1  2021-02-01   10.2   20.4     10     20     10
2  2021-03-01   10.2   20.4     10     20     10
3  2021-04-01   10.4   20.8     10     20     10
4  2021-05-01   10.4   20.8     10     20     10
5  2021-06-01   10.4   20.8     10     20     10
6  2021-07-01    9.8   19.6     10     20     10
7  2021-08-01    9.8   19.6     10     20     10
8  2021-09-01    9.8   19.6     10     20     10
9  2021-10-01   11.5   23.0     10     20     10
10 2021-11-01   11.5   23.0     10     20     10
11 2021-12-01   11.5   23.0     10     20     10

**pandas 捕获了 4 个与时间相关的一般概念:

日期时间:支持时区的特定日期和时间。类似于标准库中的 datetime.datetime。

时间增量:绝对持续时间。类似于标准库中的 datetime.timedelta。

时间跨度:由时间点及其相关频率定义的时间跨度。

日期偏移量:尊重日历算法的相对时间持续时间。类似于 dateutil 包中的 dateutil.relativedelta.relativedelta。**