将每周预测 (Pandas df) 转换为每月格式

Converting a weekly forecast (Pandas df) into monthly format

我有一个生成数据框的过程,该数据框包含每周格式的产品(和版本)预测(wc/星期一日期 - 列名称为字符串)。示例:

product     version     2021-06-07     2021-06-14     2021-06-21     2021-06-28

   a           1           500            400            300            200

   a           2           750            600            450            200

   b           1           200            150            100            100

   b           2           500            400            300            200

我被要求将预测更改为每月预测而不是每周预测。示例:

product     version       Jun-21         Jul-21         Aug-21         Sep-21

   a           1           350             x              x              x

   a           2           500             x              x              x
 
   b           1           100             x              x              x

   b           2           350             x              x              x

数字是为了展示 - 我想要做的是平均每周列(对于每一行)以创建每月输出但以准确的方式,即如果每周列是 wc/2 月 26 日,则仅3 天的价值将包含在 2 月的平均值中,而 3 月仅包含 4 天。

我知道这只是格式化/分桶的问题,但我正在努力想出一个解决方案,因为我以前从未做过这样的事情。

我不期待一个完整的解决方案,但对于我应该如何处理任务的正确方向的一点将不胜感激。

这个问题可以通过 melting 将 Dataframe 转换为长格式(而不是宽格式)来解决。在下面的例子中,我们翻译成长格式,按年-月对分组,取平均值,然后再翻译回宽格式。在 meltpivot 操作期间,创建了一些多索引,因此我们也必须处理它(最后一行代码)。

import pandas as pd

df = pd.DataFrame({
    "product": ["a", "a", "b", 'b'],
    "version": ["1", "2", "1", '2'],
    "2021-06-07": [500, 750, 200, 500],
    "2021-06-14": [400, 600, 150, 400],
    "2021-06-21": [300, 450, 100, 300],
    "2021-06-28": [200, 200, 100, 200],
    "2021-07-07": [500, 750, 200, 500],
    "2021-07-14": [400, 600, 150, 400],
    "2021-07-21": [300, 450, 100, 300],
    "2021-07-28": [200, 200, 100, 200],
})

# First, we melt into long-form data
df = df.melt(id_vars=['product', 'version'], var_name='date')

# Truncate the string to only use year-month format
df['date'] = df['date'].apply(lambda x: x[:7])

# Group by product/version/date, then take the mean
df = df.groupby(['product', 'version', 'date']).mean()

# Pivot back to wide-form table
df = df.pivot_table(index=['product', 'version'], columns='date').reset_index()

# Reset column index from multi-index to single string
df.columns = [x[0] if not x[1] else x[1] for x in df.columns]

这是一个过程,因为您需要计算一个月中的天数,确定哪些天数流入下个月,进行数学运算并将它们向前移动。这应该可以解决问题。

import pandas as pd
import numpy as np

df = pd.DataFrame({'product': ['a', 'a', 'b', 'b'],
 'version': [1, 2, 1, 2],
 '6/7/2021': [500, 750, 200, 500],
 '6/14/2021': [400, 600, 150, 400],
 '6/21/2021': [300, 450, 100, 300],
 '6/28/2021': [200, 200, 100, 200],
 })

# Convert data to long format
df = df.melt(id_vars=['product','version'], var_name='date')
# Convert date to datetime object
df['date'] = pd.to_datetime(df['date'])

# Add 7 days to the day of the month to compare to the number of days in a month
df['month_day'] = df['date'].dt.day + 7

# Get the number of days in the month
df['days_in_month'] = df['date'].dt.daysinmonth

# Subtract to see how many days the current date would extend into the next month
df['overrun'] = df['month_day']-df['days_in_month']

# Calculate the percentage of the values to push forward into the next month
df['push_forward'] = np.where(df['overrun']>0, df['value']/df['days_in_month']*df['overrun'], 0)

# Reduce the current values by the amount to be pushed forward
df['value'] = df['value'] - df['push_forward']

# Copy the records with a push_forward value to a new dataframe
df2 = df.loc[df['push_forward']>0].copy()

# Drop push_foward column
df.drop(columns='push_forward', inplace=True)

# Add a week to the date values of records with a push_foward value
df2['date'] = df2['date']+pd.DateOffset(weeks=1)

# Merge the pushed data back to the original dataframe
df = df.merge(df2[['product','version','date','push_forward']], on=['product','version','date'], how='outer')

# Fill null values
df.fillna(0, inplace=True)

# Add the push forward values to their respective weekly values
df['value'] = df['value'] + df['push_forward']

# Convert date to just the month
df['date'] = df['date'].dt.strftime('%Y-%m')

# Group and take the average
df = df.groupby(['product','version','date'])['value'].mean().reset_index()


# # Create final pivot table
df.pivot_table(index=['product','version'], columns='date', values='value')

输出

            date       2021-06    2021-07
product version     
      a        1    341.666667  33.333333
               2    491.666667  33.333333
      b        1    133.333333  16.666667
               2    341.666667  33.333333