Python Pandas 按多个类别和年份的交叉表

Python Pandas Cross Tab By Multiple Categories and Years

我有一个具有以下简化结构的数据框:

import pandas as pd

d = {'id': [1, 2, 3, 4, 5],
     'name': ["a", "b", "c", "d", "e"],
     'country': ["uk", "spain", "france", "germany", "italy"],
     'cat_01_2020': [10, 20, 30, 40, 50],
     'cat_01_2019': [11, 21, 31, 41, 51],
     'cat_01_2018': [12, 22, 32, 42, 52],
     'cat_02_2020': [100, 200, 300, 400, 500],
     'cat_02_2019': [111, 211, 311, 411, 511],
     'cat_02_2018': [122, 222, 322, 422, 522],
     'cat_03_2020': [1000, 2000, 3000, 4000, 5000],
     'cat_03_2019': [1111, 2111, 3111, 4111, 5111],
     'cat_03_2018': [1222, 2222, 3222, 4222, 5222]}

df = pd.DataFrame(data = d)

我想得到这个新的 df_target。

d_target = {'id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5],
     'name': ["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "e", "e", "e"],
     'country': ["uk", "uk", "uk", "spain", "spain", "spain", "france", "france", "france", "germany", "germany", "germany", "italy", "italy", "italy"],
     'year': [2020, 2019, 2018, 2020, 2019, 2018, 2020, 2019, 2018, 2020, 2019, 2018, 2020, 2019, 2018],
     'cat_01': [10, 11, 12, 20, 21, 22, 30, 31, 32, 40, 41, 42, 50, 51, 52],
     'cat_02': [100, 111, 122, 200, 211, 222, 300, 311, 322, 400, 411, 422, 500, 511, 522],
     'cat_03': [1000, 1111, 1222, 2000, 2111, 2222, 3000, 3111, 3222, 4000, 4111, 4222, 5000, 5111, 5222]}

df_target = pd.DataFrame(data = d_target)

df_target

为此,我想我需要使用 pandas crosstab 函数,首先获取年份 2018、2019 和 2020。然后我应该能够获取 cat_01 , cat_02 和 cat_03.

有人知道我该怎么做吗?

非常感谢您。

此致。

你要找的大概是wide_to_long:

pd.wide_to_long(df, 
                stubnames = ['cat_01', 'cat_02', 'cat_03'], 
                i = ['id', 'name', 'country'], 
                j = 'year', 
                sep = '_', 
                suffix = r"\d+").reset_index()
 
    id name  country  year  cat_01  cat_02  cat_03
0    1    a       uk  2020      10     100    1000
1    1    a       uk  2019      11     111    1111
2    1    a       uk  2018      12     122    1222
3    2    b    spain  2020      20     200    2000
4    2    b    spain  2019      21     211    2111
5    2    b    spain  2018      22     222    2222
6    3    c   france  2020      30     300    3000
7    3    c   france  2019      31     311    3111
8    3    c   france  2018      32     322    3222
9    4    d  germany  2020      40     400    4000
10   4    d  germany  2019      41     411    4111
11   4    d  germany  2018      42     422    4222
12   5    e    italy  2020      50     500    5000
13   5    e    italy  2019      51     511    5111
14   5    e    italy  2018      52     522    5222

或者,您可以使用 pivot_longer from pyjanitor:

#pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index = ['id', 'name',  'country'], 
                names_to = (".value", "year"), 
                names_pattern = r"(.+)_(\d+)$", 
                sort_by_appearance = True)

    id name  country  year  cat_01  cat_02  cat_03
0    1    a       uk  2020      10     100    1000
1    1    a       uk  2019      11     111    1111
2    1    a       uk  2018      12     122    1222
3    2    b    spain  2020      20     200    2000
4    2    b    spain  2019      21     211    2111
5    2    b    spain  2018      22     222    2222
6    3    c   france  2020      30     300    3000
7    3    c   france  2019      31     311    3111
8    3    c   france  2018      32     322    3222
9    4    d  germany  2020      40     400    4000
10   4    d  germany  2019      41     411    4111
11   4    d  germany  2018      42     422    4222
12   5    e    italy  2020      50     500    5000
13   5    e    italy  2019      51     511    5111
14   5    e    italy  2018      52     522    5222

names_to 中的 .value 保留与其关联的列名称的任何部分(在本例中为 cat.* 前缀)作为列 header(s),而其余的则进入 year 列。此拆分由 names_pattern 参数中的组决定。