Melt/unpivot 具有多组值的数据集

Question

我正在尝试转换 Python 中的数据帧，但我被卡住了，因为我不知道如何准确表达我想做的事情（这使得搜索变得困难）。看来我需要 unstack 和 pivot 的组合。不过，我希望可以用一个例子来解释它。我有一个这种形状的数据框：

userid	GroupA_measure1	GroupA_measure2	GroupB_measure1	GroupB_measure2
001	65	70	45	50
002	96	89	12	8
003	12	14	38	40

我想将其转换为这种格式：

userid	measure	groupA	groupB
001	1	65	45
001	2	70	50
002	1	96	12
002	2	89	8
003	1	12	38
003	2	14	40

我可以使用 pd.melt(df, id_vars =['userid']) 拆开整个 df，它将所有值放在单独的行中，但我想为 GroupA 和 GroupB 的值保留单独的列。

如有任何帮助，我们将不胜感激。

Answer 1

使用wide_to_long with extract numbers from measure column by Series.str.extract:

df1 = pd.wide_to_long(df, 
                      stubnames=['GroupA','GroupB'], 
                      i='userid', 
                      j='measure', sep='_', suffix=r'\w+').reset_index()

df1['measure'] = df1['measure'].str.extract('(\d+)').astype(int)

或者首先转换非 _ 列，用 _ 拆分所有列并用 DataFrame.stack 重塑，最后还提取数字：

df1 = df.set_index('userid')
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.rename_axis((None, 'measure'), axis=1).stack().reset_index()
df1['measure'] = df1['measure'].str.extract('(\d+)').astype(int)
print (df1)
  userid  measure GroupA GroupB
0    001        1     65     45
1    002        1     96     12
2    003        1     12     38
3    001        2     70     50
4    002        2     89      8
5    003        2     14     40

如有必要，最后按 DataFrame.sort_values 排序：

df1 = df1.sort_values('userid', ignore_index=True)
print (df1)
  userid  measure GroupA GroupB
0    001        1     65     45
1    001        2     70     50
2    002        1     96     12
3    002        2     89      8
4    003        1     12     38
5    003        2     14     40

Answer 2

一个选项是 pivot_longer function from pyjanitor，使用 .value 占位符：

# pip install pyjanitor
import pandas as pd
import janitor

df.pivot_longer(index="userid", 
                names_to=(".value", "measure"), 
                names_pattern=r"(.+)_*(\d)"
               )

   userid measure  GroupA_measure  GroupB_measure
0     001       1              65              45
1     002       1              96              12
2     003       1              12              38
3     001       2              70              50
4     002       2              89               8
5     003       2              14              40

names_pattern 是一个用于拆分列的正则表达式。 .value 将列的一部分保留为 header，而通过正则表达式提取的数字进入 measure 列。

Melt/unpivot 具有多组值的数据集

Melt/unpivot a dataset with multiple groups of values

python

pivot-table

pandas

pandas-melt