将数据从列堆叠到 pandas 数据框中的行

Stack data from columns to rows in pandas dataframe

我正在尝试在 pandas Dataframe 中逐年堆叠财务价值。但是无法上手。

我只试过

df1 = df.set_index(['refnum','y1gp','y2gp','y3gp']).stack()\
.reset_index(name='REV').rename(columns={'level_5':'Year'})

现有:

refnum y1 y1rev y1gp y2 y2rev y2gp y3 y3rev y3gp
10001 2021 300 200 2022 100 600 2023 300 300
10002 2020 300 200 2021 200 500 2022 300 300
10003 2021 300 200 2022 500 500 2023 300 300

预计:

refnum 反转 GP 基准年
10001 2021 300 200 基准年
10001 2022 100 600 基准年+1
10001 2023 300 300 基准年+2
10002 2020 300 200 基准年
10002 2021 200 500 基准年+1
10002 2022 300 300 基准年+2
10003 2021 300 200 基准年
10003 2022 500 500 基准年+1
10003 2023 300 300 基准年+2

让我们使用 str.replace and str.split then stack to go from wide-form to long. Then groupby cumcount 将 headers 转换为可用的 MultiIndex 以创建 BaseYear 列。

# Save Columns
df = df.set_index('refnum')
# Create a MultiIndex with Numbers at the end and split into multiple levels
df.columns = (
    df.columns.str.replace(r'^(.*?)(\d+)(.*)$', r'/', regex=True)
        .str.split('/', expand=True)
)
# Wide Format to Long + Rename Columns
df = df.stack().droplevel(-1).reset_index().rename(
    columns={'y': 'Year', 'ygp': 'GP', 'yrev': 'REV'}
)
# Add Base Year Column
df['BaseYear'] = "BaseYear+" + df.groupby('refnum').cumcount().astype(str)
# df['BaseYear'] = df.groupby('refnum').cumcount()  # (int version)

df:

   refnum  Year   GP  REV    BaseYear
0   10001  2021  200  300  BaseYear+0
1   10001  2022  600  100  BaseYear+1
2   10001  2023  300  300  BaseYear+2
3   10002  2020  200  300  BaseYear+0
4   10002  2021  500  200  BaseYear+1
5   10002  2022  300  300  BaseYear+2
6   10003  2021  200  300  BaseYear+0
7   10003  2022  500  500  BaseYear+1
8   10003  2023  300  300  BaseYear+2

尝试:

df.columns = [re.sub(r"y(\d+)(.*)", r"-", c) for c in df.columns]
x = (
    pd.wide_to_long(
        df, stubnames=["", "gp", "rev"], sep="-", i="refnum", j="Base Year"
    )
    .rename(columns={"": "year"})
    .reset_index()
    .sort_values(by="refnum")
)
print(x)

打印:

   refnum  Base Year  year   gp  rev
0   10001          1  2021  200  300
3   10001          2  2022  600  100
6   10001          3  2023  300  300
1   10002          1  2020  200  300
4   10002          2  2021  500  200
7   10002          3  2022  300  300
2   10003          1  2021  200  300
5   10003          2  2022  500  500
8   10003          3  2023  300  300

您可以使用 pyjanitor 中的 pivot_longer;对于这种情况,您将正则表达式传递给 names_pattern,并将新列名称传递给 names_to:

# pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index='refnum', 
                names_to=['year', 'REV', 'GP'], 
                names_pattern=['^y\d$', '.*rev$', '.*gp$']
               )

   refnum  year  REV   GP
0   10001  2021  300  200
1   10002  2020  300  200
2   10003  2021  300  200
3   10001  2022  100  600
4   10002  2021  200  500
5   10003  2022  500  500
6   10001  2023  300  300
7   10002  2022  300  300
8   10003  2023  300  300

如果要包含基准年,可以修改以数字结尾的列标签,然后再使用 pivot_longer:

(df.rename(columns = lambda col: f"{col}YEAR" 
                                 if col.endswith(('1','2','3')) 
                                 else col)
   .pivot_longer(index='refnum', 
                 names_to= ("Base Year", ".value"), 
                 names_pattern=r".(\d)(.+)", 
                 sort_by_appearance=True)
 )

   refnum Base Year  YEAR  rev   gp
0   10001         1  2021  300  200
1   10001         2  2022  100  600
2   10001         3  2023  300  300
3   10002         1  2020  300  200
4   10002         2  2021  200  500
5   10002         3  2022  300  300
6   10003         1  2021  300  200
7   10003         2  2022  500  500
8   10003         3  2023  300  300

.value 关联的标签保留在列 headers 中,而其余标签集中在一个新列中 (base year)