在 pandas 个可能是或可能不是 multiIndex 的数据帧上运行

Question

我有一些函数可以在 pandas 数据框中创建新列，作为数据框中现有列的函数。我在这里发生了两种不同的情况：（1）数据框不是 multiIndex 并且有一组列，比如 [a，b] 和（2）数据框是 multiIndex 并且现在具有相同的列集 headers 重复 N 次，比如 [(a,1),(b,1),(a,2),(b,2)....(a,N),(n,N)]。

我一直在按照如下所示的样式制作上述功能：

def f(df):
    if multiindex(df):
        for s df[a].columns:
            df[c,s] = someFunction(df[a,s], df[b,s])
    else:
        df[c] = someFunction(df[a], df[b])

是否有另一种方法可以做到这一点，而无需到处使用这些 if-multi-index/else 语句并复制 someFunction 代码？我不希望将多索引帧拆分为 N 个较小的数据帧（我经常需要过滤数据或做一些事情并使所有 1,2,...N 帧的行保持一致，并将它们放在一个帧中似乎是最好的方法）。

Answer 1

您可能仍需要测试 columns 是否为 MultiIndex，但这应该更清晰、更高效。警告，如果您的函数使用列上的汇总统计信息，这将不起作用。例如，如果 someFunction 除以列 'a' 的平均值。

解决方案

def someFunction(a, b):
    return a + b

def f(df):
    df = df.copy()
    ismi = isinstance(df.columns, pd.MultiIndex)
    if ismi:
        df = df.stack()

    df['c'] = someFunction(df['a'], df['a'])

    if ismi:
        df = df.unstack()

    return df

设置

import pandas as pd
import numpy as np

setup_tuples = []

for c in ['a', 'b']:
        for i in ['one', 'two', 'three']:
            setup_tuples.append((c, i))

columns = pd.MultiIndex.from_tuples(setup_tuples)

rand_array = np.random.rand(10, len(setup_tuples))

df = pd.DataFrame(rand_array, columns=columns)

df 看起来像这样

          a                             b                    
        one       two     three       one       two     three
0  0.282834  0.490313  0.201300  0.140157  0.467710  0.352555
1  0.838527  0.707131  0.763369  0.265170  0.452397  0.968125
2  0.822786  0.785226  0.434637  0.146397  0.056220  0.003197
3  0.314795  0.414096  0.230474  0.595133  0.060608  0.900934
4  0.334733  0.118689  0.054299  0.237786  0.658538  0.057256
5  0.993753  0.552942  0.665615  0.336948  0.788817  0.320329
6  0.310809  0.199921  0.158675  0.059406  0.801491  0.134779
7  0.971043  0.183953  0.723950  0.909778  0.103679  0.695661
8  0.755384  0.728327  0.029720  0.408389  0.808295  0.677195
9  0.276158  0.978232  0.623972  0.897015  0.253178  0.093772

我构建了 df 以具有 MultiIndex 列。我要做的是使用 .stack() 方法将列索引的第二级推为行索引的第二级。

df.stack() 看起来像这样

                a         b
0 one    0.282834  0.140157
  three  0.201300  0.352555
  two    0.490313  0.467710
1 one    0.838527  0.265170
  three  0.763369  0.968125
  two    0.707131  0.452397
2 one    0.822786  0.146397
  three  0.434637  0.003197
  two    0.785226  0.056220
3 one    0.314795  0.595133
  three  0.230474  0.900934
  two    0.414096  0.060608
4 one    0.334733  0.237786
  three  0.054299  0.057256
  two    0.118689  0.658538
5 one    0.993753  0.336948
  three  0.665615  0.320329
  two    0.552942  0.788817
6 one    0.310809  0.059406
  three  0.158675  0.134779
  two    0.199921  0.801491
7 one    0.971043  0.909778
  three  0.723950  0.695661
  two    0.183953  0.103679
8 one    0.755384  0.408389
  three  0.029720  0.677195
  two    0.728327  0.808295
9 one    0.276158  0.897015
  three  0.623972  0.093772
  two    0.978232  0.253178

现在您可以对 df.stack() 进行操作，就好像这些列不是 MultiIndex

示范[=35=]

print f(df)

给你想要的

          a                             b                             c  \
        one     three       two       one     three       two       one   
0  0.282834  0.201300  0.490313  0.140157  0.352555  0.467710  0.565667   
1  0.838527  0.763369  0.707131  0.265170  0.968125  0.452397  1.677055   
2  0.822786  0.434637  0.785226  0.146397  0.003197  0.056220  1.645572   
3  0.314795  0.230474  0.414096  0.595133  0.900934  0.060608  0.629591   
4  0.334733  0.054299  0.118689  0.237786  0.057256  0.658538  0.669465   
5  0.993753  0.665615  0.552942  0.336948  0.320329  0.788817  1.987507   
6  0.310809  0.158675  0.199921  0.059406  0.134779  0.801491  0.621618   
7  0.971043  0.723950  0.183953  0.909778  0.695661  0.103679  1.942086   
8  0.755384  0.029720  0.728327  0.408389  0.677195  0.808295  1.510767   
9  0.276158  0.623972  0.978232  0.897015  0.093772  0.253178  0.552317   


      three       two  
0  0.402600  0.980626  
1  1.526739  1.414262  
2  0.869273  1.570453  
3  0.460948  0.828193  
4  0.108599  0.237377  
5  1.331230  1.105884  
6  0.317349  0.399843  
7  1.447900  0.367907  
8  0.059439  1.456654  
9  1.247944  1.956464

在 pandas 个可能是或可能不是 multiIndex 的数据帧上运行

Operating on pandas dataframes that may or may not be multiIndex

python

multi-index

dataframe

pandas

解决方案

设置