如何根据相似列将宽数据框转换为长数据框

How to convert wide dataframe to long based on similar column

我有一个这样的 pandas 数据框

我想将其转换为以下数据框

我不确定如何在这里使用 pd.wide_to_long 函数

下面是创建dataframe的数据集:

Date, IN:Male teacher ,IN:Male engineer, IN: Male Atronaut , IN:female teacher ,IN:female engineer, IN: female Atronaut ,GB:Male teacher ,GB:Male engineer, GB: Male Atronaut,GB:female teacher ,GB:female engineer, GB: female Atronaut
20220405,25,29,5,41,23,23,12,23,34,11,22,34
20220404,21,29,4,40,23,22,12,23,32,10,23,34

Date 列转换为 index 并为所有其他列删除可能的尾随空格 str.strip, then replace spaces to : and last split by one or more : to MultiIndex, so possible reshape by DataFrame.stack with DataFrame.rename_axis for new columns names created by DataFrame.reset_index:

df1 = df.set_index('Date')
df1.columns = df1.columns.str.strip().str.replace('\s+', ':').str.split('[:]+', expand=True)

df1 = df1.stack([0,1]).rename_axis(['Date','Symbol','Gender']).reset_index()
print (df1)
       Date Symbol  Gender  Atronaut  engineer  teacher
0  20220405     GB    Male        34        23       12
1  20220405     GB  female        34        22       11
2  20220405     IN    Male         5        29       25
3  20220405     IN  female        23        23       41
4  20220404     GB    Male        32        23       12
5  20220404     GB  female        34        23       10
6  20220404     IN    Male         4        29       21
7  20220404     IN  female        22        23       40

pivot_longer from pyjanitor 提供了一种抽象重塑的简单方法;这种情况下可以用正则表达式解决:

# pip install pyjanitor
import pandas as pd
import janitor

df.pivot_longer(
    index = 'Date', 
    names_to = ('symbol', 'gender', '.value'), 
    names_pattern = r"(.+):\s*(.+)\s+(.+)",
    sort_by_appearance = True)

       Date symbol  gender  teacher  engineer  Atronaut
0  20220405     IN    Male       25        29         5
1  20220405     IN  female       41        23        23
2  20220405     GB    Male       12        23        34
3  20220405     GB  female       11        22        34
4  20220404     IN    Male       21        29         4
5  20220404     IN  female       40        23        22
6  20220404     GB    Male       12        23        32
7  20220404     GB  female       10        23        34

正则表达式有捕获组,任何与 .value 配对的组保持为 header,其余成为列值。