对多索引 pandas 列求和

Sum over multiindex pandas columns

我想创建一个数据框,其中列(年、季度、月)和索引(某些属性)都是分层的,即多索引。我想对某些级别进行总结,例如对属于一个季度的整个月进行总结。在 pandas 中,可以通过例如以下行:

# Axis 1 = columns, level 0 = year, level 1 = quarter
df.sum(axis=1, level=[0, 1]

这一直有效,直到在一些奇怪的情况下索引不再正确识别,触发错误消息 No axis named 1 for object type <class 'pandas.core.series.Series'>

在下面的代码中我创建了两个相同的dataframes(在两个轴上都是multiindex),只有一个区别:df1创建时不填充,df2创建时直接填充。求和适用于 df2,但不适用于 df1。我不明白,后台发生了什么,有人能给我指出一个解决方案来理解这种差异吗?

import pandas as pd
import numpy as np

cols = [(y, divmod(m - 1, 3)[0] + 1, m)
        for y in list(range(2011, 2014)) for m in list(range(1, 13))]

inds = [(a, b, c)
        for a in ["a1", "a2"] for b in ["b1", "b2"] for c in ["c1", "c2"]]

df1 = pd.DataFrame(index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
                   columns=pd.MultiIndex.from_tuples(cols, names=["year", "quarter", "month"]))

df2 = pd.DataFrame(np.ones(df1.shape),
                   index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
                   columns=pd.MultiIndex.from_tuples(cols, names=["year", "quarter", "month"]))

for (col, ind) in [(col, ind) for ind in df1.index.values for col in df1.columns.values]:
    entry = np.random.rand()
    df1.loc[ind, col] = entry
    df2.loc[ind, col] = entry

try:
    df1.sum(axis=1, level=[0, 1])
    print("Sum over df1 did work")
except:
    print("Sum over df1 did not work...")

try:
    df2.sum(axis=1, level=[0, 1])
    print("Sum over df2 did work")
except:
    print("Sum over df2 did not work...")

PS: 发现一些提示,df1中条目的类型是floatdf2中是np.float64,但这仍然没有帮助...

有问题 df1 中的所有值都是 objects,显然是 strings,但这里是 <class 'float'>:

print (df1.dtypes)
year  quarter  month
2011  1        1        object
               2        object
               3        object
      2        4        object
               5        object
               6        object
      3        7        object
               8        object
               9        object
      4        10       object

print (df2.dtypes)
year  quarter  month
2011  1        1        float64
               2        float64
               3        float64
      2        4        float64
               5        float64
               6        float64
      3        7        float64
               8        float64

因此铸造作品:

try:
    df1.astype(float).sum(axis=1, level=[0, 1])
    print("Sum over df1 did work")
except:
    print("Sum over df1 did not work...")

try:
    df2.sum(axis=1, level=[0, 1])
    print("Sum over df2 did work")
except:
    print("Sum over df2 did not work...")
Sum over df1 did work
Sum over df2 did work

for (col, ind) in [(col, ind) for ind in df1.index.values for col in df1.columns.values]:
    entry = np.random.rand()
    df1.loc[ind, col] = entry
    print (type(df1.loc[ind, col]))
    df2.loc[ind, col] = entry
    print (type(df2.loc[ind, col]))

<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>
<class 'float'>
<class 'numpy.float64'>

最好的是通过 numpy 数组创建 DataFrame,然后一切正常:

df1 = pd.DataFrame(data = np.random.rand(len(inds), len(cols)),
                   index=pd.MultiIndex.from_tuples(inds, names=["a", "b", "c"]),
                   columns=pd.MultiIndex.from_tuples(cols, names=["year","quarter","month"]))


try:
    df1.sum(axis=1, level=[0, 1])
    print("Sum over df1 did work")
except:
    print("Sum over df1 did not work...")
Sum over df1 did work