Python Pandas 按多个类别和年份的交叉表
Python Pandas Cross Tab By Multiple Categories and Years
我有一个具有以下简化结构的数据框:
import pandas as pd
d = {'id': [1, 2, 3, 4, 5],
'name': ["a", "b", "c", "d", "e"],
'country': ["uk", "spain", "france", "germany", "italy"],
'cat_01_2020': [10, 20, 30, 40, 50],
'cat_01_2019': [11, 21, 31, 41, 51],
'cat_01_2018': [12, 22, 32, 42, 52],
'cat_02_2020': [100, 200, 300, 400, 500],
'cat_02_2019': [111, 211, 311, 411, 511],
'cat_02_2018': [122, 222, 322, 422, 522],
'cat_03_2020': [1000, 2000, 3000, 4000, 5000],
'cat_03_2019': [1111, 2111, 3111, 4111, 5111],
'cat_03_2018': [1222, 2222, 3222, 4222, 5222]}
df = pd.DataFrame(data = d)
我想得到这个新的 df_target。
d_target = {'id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
'name': ["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "e", "e", "e"],
'country': ["uk", "uk", "uk", "spain", "spain", "spain", "france", "france", "france", "germany", "germany", "germany", "italy", "italy", "italy"],
'year': [2020, 2019, 2018, 2020, 2019, 2018, 2020, 2019, 2018, 2020, 2019, 2018, 2020, 2019, 2018],
'cat_01': [10, 11, 12, 20, 21, 22, 30, 31, 32, 40, 41, 42, 50, 51, 52],
'cat_02': [100, 111, 122, 200, 211, 222, 300, 311, 322, 400, 411, 422, 500, 511, 522],
'cat_03': [1000, 1111, 1222, 2000, 2111, 2222, 3000, 3111, 3222, 4000, 4111, 4222, 5000, 5111, 5222]}
df_target = pd.DataFrame(data = d_target)
df_target
为此,我想我需要使用 pandas crosstab 函数,首先获取年份 2018、2019 和 2020。然后我应该能够获取 cat_01 , cat_02 和 cat_03.
有人知道我该怎么做吗?
非常感谢您。
此致。
你要找的大概是wide_to_long:
pd.wide_to_long(df,
stubnames = ['cat_01', 'cat_02', 'cat_03'],
i = ['id', 'name', 'country'],
j = 'year',
sep = '_',
suffix = r"\d+").reset_index()
id name country year cat_01 cat_02 cat_03
0 1 a uk 2020 10 100 1000
1 1 a uk 2019 11 111 1111
2 1 a uk 2018 12 122 1222
3 2 b spain 2020 20 200 2000
4 2 b spain 2019 21 211 2111
5 2 b spain 2018 22 222 2222
6 3 c france 2020 30 300 3000
7 3 c france 2019 31 311 3111
8 3 c france 2018 32 322 3222
9 4 d germany 2020 40 400 4000
10 4 d germany 2019 41 411 4111
11 4 d germany 2018 42 422 4222
12 5 e italy 2020 50 500 5000
13 5 e italy 2019 51 511 5111
14 5 e italy 2018 52 522 5222
或者,您可以使用 pivot_longer from pyjanitor:
#pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index = ['id', 'name', 'country'],
names_to = (".value", "year"),
names_pattern = r"(.+)_(\d+)$",
sort_by_appearance = True)
id name country year cat_01 cat_02 cat_03
0 1 a uk 2020 10 100 1000
1 1 a uk 2019 11 111 1111
2 1 a uk 2018 12 122 1222
3 2 b spain 2020 20 200 2000
4 2 b spain 2019 21 211 2111
5 2 b spain 2018 22 222 2222
6 3 c france 2020 30 300 3000
7 3 c france 2019 31 311 3111
8 3 c france 2018 32 322 3222
9 4 d germany 2020 40 400 4000
10 4 d germany 2019 41 411 4111
11 4 d germany 2018 42 422 4222
12 5 e italy 2020 50 500 5000
13 5 e italy 2019 51 511 5111
14 5 e italy 2018 52 522 5222
names_to
中的 .value
保留与其关联的列名称的任何部分(在本例中为 cat.*
前缀)作为列 header(s),而其余的则进入 year
列。此拆分由 names_pattern
参数中的组决定。
我有一个具有以下简化结构的数据框:
import pandas as pd
d = {'id': [1, 2, 3, 4, 5],
'name': ["a", "b", "c", "d", "e"],
'country': ["uk", "spain", "france", "germany", "italy"],
'cat_01_2020': [10, 20, 30, 40, 50],
'cat_01_2019': [11, 21, 31, 41, 51],
'cat_01_2018': [12, 22, 32, 42, 52],
'cat_02_2020': [100, 200, 300, 400, 500],
'cat_02_2019': [111, 211, 311, 411, 511],
'cat_02_2018': [122, 222, 322, 422, 522],
'cat_03_2020': [1000, 2000, 3000, 4000, 5000],
'cat_03_2019': [1111, 2111, 3111, 4111, 5111],
'cat_03_2018': [1222, 2222, 3222, 4222, 5222]}
df = pd.DataFrame(data = d)
我想得到这个新的 df_target。
d_target = {'id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
'name': ["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "e", "e", "e"],
'country': ["uk", "uk", "uk", "spain", "spain", "spain", "france", "france", "france", "germany", "germany", "germany", "italy", "italy", "italy"],
'year': [2020, 2019, 2018, 2020, 2019, 2018, 2020, 2019, 2018, 2020, 2019, 2018, 2020, 2019, 2018],
'cat_01': [10, 11, 12, 20, 21, 22, 30, 31, 32, 40, 41, 42, 50, 51, 52],
'cat_02': [100, 111, 122, 200, 211, 222, 300, 311, 322, 400, 411, 422, 500, 511, 522],
'cat_03': [1000, 1111, 1222, 2000, 2111, 2222, 3000, 3111, 3222, 4000, 4111, 4222, 5000, 5111, 5222]}
df_target = pd.DataFrame(data = d_target)
df_target
为此,我想我需要使用 pandas crosstab 函数,首先获取年份 2018、2019 和 2020。然后我应该能够获取 cat_01 , cat_02 和 cat_03.
有人知道我该怎么做吗?
非常感谢您。
此致。
你要找的大概是wide_to_long:
pd.wide_to_long(df,
stubnames = ['cat_01', 'cat_02', 'cat_03'],
i = ['id', 'name', 'country'],
j = 'year',
sep = '_',
suffix = r"\d+").reset_index()
id name country year cat_01 cat_02 cat_03
0 1 a uk 2020 10 100 1000
1 1 a uk 2019 11 111 1111
2 1 a uk 2018 12 122 1222
3 2 b spain 2020 20 200 2000
4 2 b spain 2019 21 211 2111
5 2 b spain 2018 22 222 2222
6 3 c france 2020 30 300 3000
7 3 c france 2019 31 311 3111
8 3 c france 2018 32 322 3222
9 4 d germany 2020 40 400 4000
10 4 d germany 2019 41 411 4111
11 4 d germany 2018 42 422 4222
12 5 e italy 2020 50 500 5000
13 5 e italy 2019 51 511 5111
14 5 e italy 2018 52 522 5222
或者,您可以使用 pivot_longer from pyjanitor:
#pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index = ['id', 'name', 'country'],
names_to = (".value", "year"),
names_pattern = r"(.+)_(\d+)$",
sort_by_appearance = True)
id name country year cat_01 cat_02 cat_03
0 1 a uk 2020 10 100 1000
1 1 a uk 2019 11 111 1111
2 1 a uk 2018 12 122 1222
3 2 b spain 2020 20 200 2000
4 2 b spain 2019 21 211 2111
5 2 b spain 2018 22 222 2222
6 3 c france 2020 30 300 3000
7 3 c france 2019 31 311 3111
8 3 c france 2018 32 322 3222
9 4 d germany 2020 40 400 4000
10 4 d germany 2019 41 411 4111
11 4 d germany 2018 42 422 4222
12 5 e italy 2020 50 500 5000
13 5 e italy 2019 51 511 5111
14 5 e italy 2018 52 522 5222
names_to
中的 .value
保留与其关联的列名称的任何部分(在本例中为 cat.*
前缀)作为列 header(s),而其余的则进入 year
列。此拆分由 names_pattern
参数中的组决定。