将数据从列堆叠到 pandas 数据框中的行

Question

我正在尝试在 pandas Dataframe 中逐年堆叠财务价值。但是无法上手。

我只试过

df1 = df.set_index(['refnum','y1gp','y2gp','y3gp']).stack()\
.reset_index(name='REV').rename(columns={'level_5':'Year'})

现有：

refnum	y1	y1rev	y1gp	y2	y2rev	y2gp	y3	y3rev	y3gp
10001	2021	300	200	2022	100	600	2023	300	300
10002	2020	300	200	2021	200	500	2022	300	300
10003	2021	300	200	2022	500	500	2023	300	300

预计：

refnum	年	反转	GP	基准年
10001	2021	300	200	基准年
10001	2022	100	600	基准年+1
10001	2023	300	300	基准年+2
10002	2020	300	200	基准年
10002	2021	200	500	基准年+1
10002	2022	300	300	基准年+2
10003	2021	300	200	基准年
10003	2022	500	500	基准年+1
10003	2023	300	300	基准年+2

Answer 1

让我们使用 str.replace and str.split then stack to go from wide-form to long. Then groupby cumcount 将 headers 转换为可用的 MultiIndex 以创建 BaseYear 列。

# Save Columns
df = df.set_index('refnum')
# Create a MultiIndex with Numbers at the end and split into multiple levels
df.columns = (
    df.columns.str.replace(r'^(.*?)(\d+)(.*)$', r'/', regex=True)
        .str.split('/', expand=True)
)
# Wide Format to Long + Rename Columns
df = df.stack().droplevel(-1).reset_index().rename(
    columns={'y': 'Year', 'ygp': 'GP', 'yrev': 'REV'}
)
# Add Base Year Column
df['BaseYear'] = "BaseYear+" + df.groupby('refnum').cumcount().astype(str)
# df['BaseYear'] = df.groupby('refnum').cumcount()  # (int version)

df:

   refnum  Year   GP  REV    BaseYear
0   10001  2021  200  300  BaseYear+0
1   10001  2022  600  100  BaseYear+1
2   10001  2023  300  300  BaseYear+2
3   10002  2020  200  300  BaseYear+0
4   10002  2021  500  200  BaseYear+1
5   10002  2022  300  300  BaseYear+2
6   10003  2021  200  300  BaseYear+0
7   10003  2022  500  500  BaseYear+1
8   10003  2023  300  300  BaseYear+2

Answer 2

尝试：

df.columns = [re.sub(r"y(\d+)(.*)", r"-", c) for c in df.columns]
x = (
    pd.wide_to_long(
        df, stubnames=["", "gp", "rev"], sep="-", i="refnum", j="Base Year"
    )
    .rename(columns={"": "year"})
    .reset_index()
    .sort_values(by="refnum")
)
print(x)

打印：

   refnum  Base Year  year   gp  rev
0   10001          1  2021  200  300
3   10001          2  2022  600  100
6   10001          3  2023  300  300
1   10002          1  2020  200  300
4   10002          2  2021  500  200
7   10002          3  2022  300  300
2   10003          1  2021  200  300
5   10003          2  2022  500  500
8   10003          3  2023  300  300

Answer 3

您可以使用 pyjanitor 中的 pivot_longer；对于这种情况，您将正则表达式传递给 names_pattern，并将新列名称传递给 names_to:

# pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index='refnum', 
                names_to=['year', 'REV', 'GP'], 
                names_pattern=['^y\d$', '.*rev$', '.*gp$']
               )

   refnum  year  REV   GP
0   10001  2021  300  200
1   10002  2020  300  200
2   10003  2021  300  200
3   10001  2022  100  600
4   10002  2021  200  500
5   10003  2022  500  500
6   10001  2023  300  300
7   10002  2022  300  300
8   10003  2023  300  300

如果要包含基准年，可以修改以数字结尾的列标签，然后再使用 pivot_longer:

(df.rename(columns = lambda col: f"{col}YEAR" 
                                 if col.endswith(('1','2','3')) 
                                 else col)
   .pivot_longer(index='refnum', 
                 names_to= ("Base Year", ".value"), 
                 names_pattern=r".(\d)(.+)", 
                 sort_by_appearance=True)
 )

   refnum Base Year  YEAR  rev   gp
0   10001         1  2021  300  200
1   10001         2  2022  100  600
2   10001         3  2023  300  300
3   10002         1  2020  300  200
4   10002         2  2021  200  500
5   10002         3  2022  300  300
6   10003         1  2021  300  200
7   10003         2  2022  500  500
8   10003         3  2023  300  300

与 .value 关联的标签保留在列 headers 中，而其余标签集中在一个新列中 (base year)

将数据从列堆叠到 pandas 数据框中的行

Stack data from columns to rows in pandas dataframe

python

python-3.x

pandas

dataframe

stack