重塑 python 中的数据集

reshaping the dataset in python

我有这个数据集:

Account lookup FY11USD FY12USD FY11local FY12local
Sales CA 1000 5000 800 4800
Sales JP 5000 6500 10 15

尝试以这种格式获取数据:(下面的示例有 2 年的数据,但年数可能会有所不同)

Account lookup Year USD Local
Sales CA FY11 1000 800
Sales CA FY12 5000 4800
Sales JP FY11 5000 10
Sales JP FY12 6500 15

我试过使用下面的脚本,但它并没有在同一年将美元和本地货币分开。我应该怎么做?

df.melt(id_vars=["Account", "lookup"], 
    var_name="Year", 
    value_name="Value")

你可以这样拼凑起来:

dfn = (pd.concat( 
[df[["Account", "lookup", 'FY11USD','FY12USD']].melt(id_vars=["Account", "lookup"], var_name="Year", value_name="USD"),
df[["Account", "lookup", 'FY11local','FY12local']].melt(id_vars=["Account", "lookup"], var_name="Year", value_name="Local")[['Local']]], axis=1 ))
dfn['Year'] = dfn['Year'].str[:4]

输出

  Account lookup  Year   USD  Local
0   Sales     CA  FY11  1000    800
1   Sales     JP  FY11  5000     10
2   Sales     CA  FY12  5000   4800
3   Sales     JP  FY12  6500     15

一个有效的选择是使用 pivot_longer from pyjanitor 转换为长格式,使用 .value 占位符 ---> .value 确定列的哪些部分保留为 headers:

# pip install pyjanitor
import pandas as pd
import janitor

df.pivot_longer(
     index = ['Account', 'lookup'], 
     names_to = ('Year', '.value'), 
     names_pattern = r"(FY\d+)(.+)")

  Account lookup  Year   USD  local
0   Sales     CA  FY11  1000    800
1   Sales     JP  FY11  5000     10
2   Sales     CA  FY12  5000   4800
3   Sales     JP  FY12  6500     15

另一种选择是使用堆栈:

temp = df.set_index(['Account', 'lookup'])
temp.columns = temp.columns.str.split('(FY\d+)', expand = True).droplevel(0)
temp.columns.names = ['Year', None]
temp.stack('Year').reset_index()

  Account lookup  Year   USD  local
0   Sales     CA  FY11  1000    800
1   Sales     CA  FY12  5000   4800
2   Sales     JP  FY11  5000     10
3   Sales     JP  FY12  6500     15

您也可以在重塑列后使用 pd.wide_to_long 实现它:

index = ['Account', 'lookup']
temp = df.set_index(index)
temp.columns = (temp
                .columns
                .str.split('(FY\d+)')
                .str[::-1]
                .str.join('')
               )
(pd.wide_to_long(
      temp.reset_index(), 
      stubnames = ['USD', 'local'], 
      i = index, 
      j = 'Year', 
      suffix = '.+')
.reset_index()
)

  Account lookup  Year   USD  local
0   Sales     CA  FY11  1000    800
1   Sales     CA  FY12  5000   4800
2   Sales     JP  FY11  5000     10
3   Sales     JP  FY12  6500     15