pandas 数据帧划分后样式无法在多索引上设置背景渐变

pandas style can't set background gradient on multiindex after dataframe division

我可以在多索引上做 pandas 样式热图没问题:

df = sns.load_dataset('geyser').reset_index()

df['3m_duration'] = df.duration > 3

group_cols = ['kind', '3m_duration']

count_gpby = df[
    group_cols + ['index']
].groupby(
    group_cols
)

count_gpby.count().style.background_gradient(cmap ='Blues')

我还可以将一个子集 groupby 除以总 groupby 以获得每个组的比较 rate/ratio:

df['binary'] = 'A'
df.loc[100:, 'binary'] = 'B'

subset_gpby = df[
    group_cols + ['index']
].loc[df.binary=='B'].groupby(
    group_cols
).count()

(subset_gpby / gpby).style.background_gradient(cmap ='Blues')

但后来我尝试将这两个“视图”合并为同一多索引数据框中的两列,以便我可以同时看到原始计数和比较比率。打印没有问题:

但由于“非唯一索引”,它无法使用 Pandas 样式的热图背景渐变显示:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-275-f82cbb6545e2> in <module>
----> 1 pd.concat([(subset_gpby / gpby), gpby], axis=1).style.background_gradient(cmap ='Blues')

C:\ProgramData\Anaconda3\envs\venv\lib\site-packages\pandas\core\frame.py in style(self)
    959         from pandas.io.formats.style import Styler
    960 
--> 961         return Styler(self)
    962 
    963     _shared_docs[

C:\ProgramData\Anaconda3\envs\venv\lib\site-packages\pandas\io\formats\style.py in __init__(self, data, precision, table_styles, uuid, caption, table_attributes, cell_ids, na_rep, uuid_len)
    161             data = data.to_frame()
    162         if not data.index.is_unique or not data.columns.is_unique:
--> 163             raise ValueError("style is not supported for non-unique indices.")
    164 
    165         self.data = data

ValueError: style is not supported for non-unique indices.

然而,

pd.concat([(subset_gpby / gpby), gpby], axis=1).index.value_counts()

> (short, False)    1
> (short, True)     1
> (long, True)      1
> (long, False)     1
> dtype: int64

显示每个索引只有一个实例,并且该索引等于之前呈现的没有问题的索引:

pd.concat([(subset_gpby / gpby), gpby], axis=1).index == (subset_gpby / gpby).index

> array([ True,  True,  True,  True])

为什么会出现这个错误?

在 pandas 中,“索引”和“列”均为 pd.Index. For this reason, both axes can be referred to as an Index. The Styler object only works on uniquely indexed DataFrames (See other limitations 类型),这包括 两个 维度。

concat 处理这两项时,我们最终得到多个名为 'index':

的列
pd.concat([(subset_gpby / gpby), gpby], axis=1)

                      index  index  # <- Note the duplicate column names
kind  3m_duration                 
long  False        1.000000      1
      True         0.631579    171
short False        0.635417     96
      True         0.500000      4

因为我们没有有意义的列名,我们可以简单地将 ignore_index=True 传递给 concat(注意这只影响连接轴,在本例中为 axis=1):

pd.concat([(subset_gpby / gpby), gpby], axis=1, ignore_index=True)

                          0    1
kind  3m_duration               
long  False        1.000000    1
      True         0.631579  171
short False        0.635417   96
      True         0.500000    4

或者,一如既往,我们可以将列重命名为有意义的。但是,我们需要像 set_axis here since rename 这样的东西会影响所有名为“index”的列:

pd.concat(
    [(subset_gpby / gpby), gpby], axis=1
).set_axis(['Subset gpby', 'gpby'], axis=1)

                   Subset gpby  gpby
kind  3m_duration                   
long  False           1.000000     1
      True            0.631579   171
short False           0.635417    96
      True            0.500000     4

无论如何,我们将能够再次使用 background_gradient,因为列索引是唯一的:

pd.concat(
    [(subset_gpby / gpby), gpby], axis=1, ignore_index=True
).style.background_gradient(cmap='Blues')


使用的设置:

import pandas as pd

import seaborn as sns

# Setup Data
df = sns.load_dataset('geyser').reset_index()
group_cols = ['kind', '3m_duration']
df['3m_duration'] = df['duration'].gt(3)
subset_df = df[[*group_cols, 'index']].copy()
# Build Count DataFrames
gpby = subset_df.groupby(group_cols).count()
subset_gpby = subset_df.loc[100:, :].groupby(group_cols).count()