将数据从列堆叠到 pandas 数据框中的行
Stack data from columns to rows in pandas dataframe
我正在尝试在 pandas Dataframe 中逐年堆叠财务价值。但是无法上手。
我只试过
df1 = df.set_index(['refnum','y1gp','y2gp','y3gp']).stack()\
.reset_index(name='REV').rename(columns={'level_5':'Year'})
现有:
refnum
y1
y1rev
y1gp
y2
y2rev
y2gp
y3
y3rev
y3gp
10001
2021
300
200
2022
100
600
2023
300
300
10002
2020
300
200
2021
200
500
2022
300
300
10003
2021
300
200
2022
500
500
2023
300
300
预计:
refnum
年
反转
GP
基准年
10001
2021
300
200
基准年
10001
2022
100
600
基准年+1
10001
2023
300
300
基准年+2
10002
2020
300
200
基准年
10002
2021
200
500
基准年+1
10002
2022
300
300
基准年+2
10003
2021
300
200
基准年
10003
2022
500
500
基准年+1
10003
2023
300
300
基准年+2
让我们使用 str.replace
and str.split
then stack
to go from wide-form to long. Then groupby cumcount
将 headers 转换为可用的 MultiIndex 以创建 BaseYear 列。
# Save Columns
df = df.set_index('refnum')
# Create a MultiIndex with Numbers at the end and split into multiple levels
df.columns = (
df.columns.str.replace(r'^(.*?)(\d+)(.*)$', r'/', regex=True)
.str.split('/', expand=True)
)
# Wide Format to Long + Rename Columns
df = df.stack().droplevel(-1).reset_index().rename(
columns={'y': 'Year', 'ygp': 'GP', 'yrev': 'REV'}
)
# Add Base Year Column
df['BaseYear'] = "BaseYear+" + df.groupby('refnum').cumcount().astype(str)
# df['BaseYear'] = df.groupby('refnum').cumcount() # (int version)
df
:
refnum Year GP REV BaseYear
0 10001 2021 200 300 BaseYear+0
1 10001 2022 600 100 BaseYear+1
2 10001 2023 300 300 BaseYear+2
3 10002 2020 200 300 BaseYear+0
4 10002 2021 500 200 BaseYear+1
5 10002 2022 300 300 BaseYear+2
6 10003 2021 200 300 BaseYear+0
7 10003 2022 500 500 BaseYear+1
8 10003 2023 300 300 BaseYear+2
尝试:
df.columns = [re.sub(r"y(\d+)(.*)", r"-", c) for c in df.columns]
x = (
pd.wide_to_long(
df, stubnames=["", "gp", "rev"], sep="-", i="refnum", j="Base Year"
)
.rename(columns={"": "year"})
.reset_index()
.sort_values(by="refnum")
)
print(x)
打印:
refnum Base Year year gp rev
0 10001 1 2021 200 300
3 10001 2 2022 600 100
6 10001 3 2023 300 300
1 10002 1 2020 200 300
4 10002 2 2021 500 200
7 10002 3 2022 300 300
2 10003 1 2021 200 300
5 10003 2 2022 500 500
8 10003 3 2023 300 300
您可以使用 pyjanitor
中的 pivot_longer
;对于这种情况,您将正则表达式传递给 names_pattern
,并将新列名称传递给 names_to
:
# pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index='refnum',
names_to=['year', 'REV', 'GP'],
names_pattern=['^y\d$', '.*rev$', '.*gp$']
)
refnum year REV GP
0 10001 2021 300 200
1 10002 2020 300 200
2 10003 2021 300 200
3 10001 2022 100 600
4 10002 2021 200 500
5 10003 2022 500 500
6 10001 2023 300 300
7 10002 2022 300 300
8 10003 2023 300 300
如果要包含基准年,可以修改以数字结尾的列标签,然后再使用 pivot_longer
:
(df.rename(columns = lambda col: f"{col}YEAR"
if col.endswith(('1','2','3'))
else col)
.pivot_longer(index='refnum',
names_to= ("Base Year", ".value"),
names_pattern=r".(\d)(.+)",
sort_by_appearance=True)
)
refnum Base Year YEAR rev gp
0 10001 1 2021 300 200
1 10001 2 2022 100 600
2 10001 3 2023 300 300
3 10002 1 2020 300 200
4 10002 2 2021 200 500
5 10002 3 2022 300 300
6 10003 1 2021 300 200
7 10003 2 2022 500 500
8 10003 3 2023 300 300
与 .value
关联的标签保留在列 headers 中,而其余标签集中在一个新列中 (base year
)
我正在尝试在 pandas Dataframe 中逐年堆叠财务价值。但是无法上手。
我只试过
df1 = df.set_index(['refnum','y1gp','y2gp','y3gp']).stack()\
.reset_index(name='REV').rename(columns={'level_5':'Year'})
现有:
refnum | y1 | y1rev | y1gp | y2 | y2rev | y2gp | y3 | y3rev | y3gp |
---|---|---|---|---|---|---|---|---|---|
10001 | 2021 | 300 | 200 | 2022 | 100 | 600 | 2023 | 300 | 300 |
10002 | 2020 | 300 | 200 | 2021 | 200 | 500 | 2022 | 300 | 300 |
10003 | 2021 | 300 | 200 | 2022 | 500 | 500 | 2023 | 300 | 300 |
预计:
refnum | 年 | 反转 | GP | 基准年 |
---|---|---|---|---|
10001 | 2021 | 300 | 200 | 基准年 |
10001 | 2022 | 100 | 600 | 基准年+1 |
10001 | 2023 | 300 | 300 | 基准年+2 |
10002 | 2020 | 300 | 200 | 基准年 |
10002 | 2021 | 200 | 500 | 基准年+1 |
10002 | 2022 | 300 | 300 | 基准年+2 |
10003 | 2021 | 300 | 200 | 基准年 |
10003 | 2022 | 500 | 500 | 基准年+1 |
10003 | 2023 | 300 | 300 | 基准年+2 |
让我们使用 str.replace
and str.split
then stack
to go from wide-form to long. Then groupby cumcount
将 headers 转换为可用的 MultiIndex 以创建 BaseYear 列。
# Save Columns
df = df.set_index('refnum')
# Create a MultiIndex with Numbers at the end and split into multiple levels
df.columns = (
df.columns.str.replace(r'^(.*?)(\d+)(.*)$', r'/', regex=True)
.str.split('/', expand=True)
)
# Wide Format to Long + Rename Columns
df = df.stack().droplevel(-1).reset_index().rename(
columns={'y': 'Year', 'ygp': 'GP', 'yrev': 'REV'}
)
# Add Base Year Column
df['BaseYear'] = "BaseYear+" + df.groupby('refnum').cumcount().astype(str)
# df['BaseYear'] = df.groupby('refnum').cumcount() # (int version)
df
:
refnum Year GP REV BaseYear
0 10001 2021 200 300 BaseYear+0
1 10001 2022 600 100 BaseYear+1
2 10001 2023 300 300 BaseYear+2
3 10002 2020 200 300 BaseYear+0
4 10002 2021 500 200 BaseYear+1
5 10002 2022 300 300 BaseYear+2
6 10003 2021 200 300 BaseYear+0
7 10003 2022 500 500 BaseYear+1
8 10003 2023 300 300 BaseYear+2
尝试:
df.columns = [re.sub(r"y(\d+)(.*)", r"-", c) for c in df.columns]
x = (
pd.wide_to_long(
df, stubnames=["", "gp", "rev"], sep="-", i="refnum", j="Base Year"
)
.rename(columns={"": "year"})
.reset_index()
.sort_values(by="refnum")
)
print(x)
打印:
refnum Base Year year gp rev
0 10001 1 2021 200 300
3 10001 2 2022 600 100
6 10001 3 2023 300 300
1 10002 1 2020 200 300
4 10002 2 2021 500 200
7 10002 3 2022 300 300
2 10003 1 2021 200 300
5 10003 2 2022 500 500
8 10003 3 2023 300 300
您可以使用 pyjanitor
中的 pivot_longer
;对于这种情况,您将正则表达式传递给 names_pattern
,并将新列名称传递给 names_to
:
# pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index='refnum',
names_to=['year', 'REV', 'GP'],
names_pattern=['^y\d$', '.*rev$', '.*gp$']
)
refnum year REV GP
0 10001 2021 300 200
1 10002 2020 300 200
2 10003 2021 300 200
3 10001 2022 100 600
4 10002 2021 200 500
5 10003 2022 500 500
6 10001 2023 300 300
7 10002 2022 300 300
8 10003 2023 300 300
如果要包含基准年,可以修改以数字结尾的列标签,然后再使用 pivot_longer
:
(df.rename(columns = lambda col: f"{col}YEAR"
if col.endswith(('1','2','3'))
else col)
.pivot_longer(index='refnum',
names_to= ("Base Year", ".value"),
names_pattern=r".(\d)(.+)",
sort_by_appearance=True)
)
refnum Base Year YEAR rev gp
0 10001 1 2021 300 200
1 10001 2 2022 100 600
2 10001 3 2023 300 300
3 10002 1 2020 300 200
4 10002 2 2021 200 500
5 10002 3 2022 300 300
6 10003 1 2021 300 200
7 10003 2 2022 500 500
8 10003 3 2023 300 300
与 .value
关联的标签保留在列 headers 中,而其余标签集中在一个新列中 (base year
)