构建的相同 MultiIndex DataFrame 不聚合(均值)

Same MultiIndex DataFrame constructed doesn't aggregate (mean)

小问题:

我试图在以两种不同方式对多索引 Pandas DataFrame 进行分组后获取列(数据系列)的平均值。区别仅在于 DataFrame 的构造。一个给了我想要的结果,另一个给出了错误 DataError: No numeric types to aggregate

描述:

施工常用数据

import pandas as pd
import numpy as np
indexTuples = [('a', 1), ('b', 3), ('a', 2), ('c', 2), ('c', 3), ('b', 8)]
multiIndex = pd.MultiIndex.from_tuples(indexTuples, names = ['x', 'y'])

通过方法1构建DataFrame

columns = ['alpha', 'beta', 'gamma']
df = pd.DataFrame(index=multiIndex, columns=columns)

alpha = pd.Series(index=multiIndex)
beta = pd.Series(index=multiIndex)
gamma = pd.Series(index=multiIndex)

for tup in indexTuples:
    alpha[tup[0], tup[1]] = np.random.randint(400)
    beta[tup[0], tup[1]] = np.random.randint(400)
    gamma[tup[0], tup[1]] = np.random.randint(400)

df.alpha = alpha
df.beta = beta
df.gamma = gamma

df.alpha['a'] = np.nan

df

给出如下所示的数据框

     alpha   beta  gamma
x y                     
a 1    NaN  136.0  224.0
b 3  375.0  227.0  191.0
a 2    NaN  367.0  195.0
c 2  247.0   61.0   78.0
  3  238.0  187.0  366.0
b 8  302.0   14.0  272.0    

如果我执行以下操作,我会得到预期的结果

df.groupby(level='x').alpha.mean()

结果

x
a      NaN
b    148.0
c    244.5
Name: alpha, dtype: float64

通过方法2构建DataFrame

columns = ['alpha', 'beta', 'gamma']
_df = pd.DataFrame(index=multiIndex, columns=columns)

for tup in indexTuples:
    _df.alpha[tup[0], tup[1]] = np.random.randint(400)
    _df.beta[tup[0], tup[1]] = np.random.randint(400)
    _df.gamma[tup[0], tup[1]] = np.random.randint(400)

_df.alpha['a'] = np.nan

给出一个外观与 NaN 值相似的 DataFrame,如先前方法

所示

但是现在当我试图在按级别分组后求均值时

_df.groupby(level='x').alpha.mean() 

我收到以下错误

---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-192-ad2de6450fab> in <module>()
----> 1 _df.groupby(level='x').alpha.mean()

/film/tools/packages/pandas/0.18.0/CentOS-6.2_thru_7/python-2.7/lib/python2.7/site-packages/pandas-0.18.0-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in mean(self)
    933         """
    934         try:
--> 935             return self._cython_agg_general('mean')
    936         except GroupByError:
    937             raise

/film/tools/packages/pandas/0.18.0/CentOS-6.2_thru_7/python-2.7/lib/python2.7/site-packages/pandas-0.18.0-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in _cython_agg_general(self, how, numeric_only)
    750 
    751         if len(output) == 0:
--> 752             raise DataError('No numeric types to aggregate')
    753 
    754         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

为什么第一种情况有效而第二种情况无效?

当您构建 _df 时,dtype 变成了 object。发生这种情况是因为在您定义 _df 时,您没有使用任何数据启动它并且默认为 object。在构造 #1 中,您通过分配 series 独立构造的值和浮点类型来克服这个问题。在构造 #2 中,您显式地分配了 _df 个数据位置。这些位置已被视为 object.

_df.dtypes

alpha    object
beta     object
gamma    object
dtype: object

用这个来得到你的结果:

_df.astype(float).groupby(level='x').alpha.mean()