如何构建 pandas 数据框以绘制嵌套 pie/donut 图表?

How to structure a pandas dataframe for plotting nested pie/donut charts?

这很相似,但它已过时并且代码不适用于 Pandas 的当前版本:

这是我正在努力实现的一个常见示例;虽然它不一定是准确的:

我正在尝试创建一个看起来像这样但带有标签的图表。我知道每个级别的标签都是荒谬的,所以我正在寻找一种方法来说明特定数量下的任何内容都将被归类为“其他”: https://matplotlib.org/3.5.1/gallery/pie_and_polar_charts/nested_pie.html

我有以下 table: https://pastebin.com/raw/vC5C355D

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("https://pastebin.com/raw/vC5C355D", sep="\t", index_col=0)

老实说,我什至不知道从哪里开始。在该层次结构顺序中有 5 个不同的层次结构级别 [class, order, family, genus, species]

我是否遍历每个级别并对每一列执行 .value_counts()?如果是这样,层次结构是如何保存的?我不确定如何构建数据框来绘制它。

有人可以在以下方面提供一些帮助 1) 构建数据框以便它可以用于分层 pie/donut 图表;和 2) 如何使 the documentation 适应所述数据框?

how to structure the dataframe so it can be used for hierarchical pie/donut charts

这是分层的理想情况 MultiIndex:

  1. 使用 df.value_counts 在 MultiIndex 中生成计数(每个级别一个特征):

    counts = df.value_counts() # long output shown at bottom of post
    
  2. 然后楔值可以简单地用groupby.sum计算,例如对于 2 级:

    counts.groupby(level=[0, 1, 2]).sum() # long output shown at bottom of post
    

matplotlib nested donut demo 使用与 numpy 数组相同的概念(每个矩阵维度一个特征),但对于更高的维度来说太笨重了。将计数构造为 n 级多索引比 n 维数组要简单得多。


how to adapt the documentation to said dataframe

更新:代码现在根据根节点为楔形着色:

转换原始 DataFrame 的完整代码 -> 嵌套甜甜圈(带有更易于管理的演示示例):

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

WEDGE_SIZE = 0.5
LABEL_THRESHOLD = 1

df = pd.DataFrame({'one': list('AAAAAAAAABBBBBBBCCCC'), 'two': list('DDDDDDEEEFFFGGGGHHII'), 'three': list('JJJKKLLMMMMNNNNNNNNN'), 'four': list('OOPPPPQQRSTTTUUUUVVV'), 'five': list('WWWXXXXXXYYYYYYZZZZZ')}).cumsum(1)

fig, ax = plt.subplots()

# generate MultiIndex of counts with one feature per level
counts = df.value_counts()

# define primary colormaps (cycle if levels > 6)
cmaps = np.resize(['Blues_r', 'Greens_r', 'Oranges_r', 'Purples_r', 'Reds_r', 'Greys_r'],
                  counts.index.get_level_values(0).size)

for level in range(len(counts.index.names)):
    # compute grouped sums up to current level
    wedges = counts.groupby(level=list(range(level+1))).sum()

    # extract annotation labels from MultiIndex
    labels = wedges.index.get_level_values(level)

    # generate color shades per group
    index = [(i,) if level == 0 else i for i in wedges.index.tolist()] # standardize Index vs MultiIndex
    g0 = pd.DataFrame.from_records(index).groupby(0)
    maps = g0.ngroup()
    shades = g0.cumcount() / g0.size().max()
    colors = [plt.get_cmap(cmaps[m])(s) for m, s in zip(maps, shades)]
    
    # plot colorized/labeled donut layer
    ax.pie(x=wedges,
           radius=1 + (level * WEDGE_SIZE),
           colors=colors,
           labels=np.where(wedges >= LABEL_THRESHOLD, labels, ''), # unlabel if under threshold
           rotatelabels=True,
           labeldistance=1.1 - 1.4/(level+3.5), # put labels inside wedge instead of outside (requires manual tweaking)
           wedgeprops=dict(width=WEDGE_SIZE, linewidth=0, alpha=0.33))

请注意,您的示例数据映射到大量楔形(外层 = 199 种),因此将较小的值聚合为“其他”实际上不会起作用。楔子基本上都是一样的小尺寸,所以我不确定如何合理标记这个完整的样本。

左边是完整样本,右边是较小的子集:


作为参考,这些是 df -> df.value_counts -> groupby.sum.

的输出

原文df

>>> df = pd.DataFrame({'one': list('AAAAAAAAABBBBBBBCCCC'), 'two': list('DDDDDDEEEFFFGGGGHHII'), 'three': list('JJJKKLLMMMMNNNNNNNNN'), 'four': list('OOPPPPQQRSTTTUUUUVVV'), 'five': list('WWWXXXXXXYYYYYYZZZZZ')}).cumsum(1)
>>> df

   one two three  four   five
0    A  AD   ADJ  ADJO  ADJOW
1    A  AD   ADJ  ADJO  ADJOW
2    A  AD   ADJ  ADJP  ADJPW
3    A  AD   ADK  ADKP  ADKPX
4    A  AD   ADK  ADKP  ADKPX
5    A  AD   ADL  ADLP  ADLPX
6    A  AE   AEL  AELQ  AELQX
7    A  AE   AEM  AEMQ  AEMQX
8    A  AE   AEM  AEMR  AEMRX
9    B  BF   BFM  BFMS  BFMSY
10   B  BF   BFM  BFMT  BFMTY
11   B  BF   BFN  BFNT  BFNTY
12   B  BG   BGN  BGNT  BGNTY
13   B  BG   BGN  BGNU  BGNUY
14   B  BG   BGN  BGNU  BGNUY
15   B  BG   BGN  BGNU  BGNUZ
16   C  CH   CHN  CHNU  CHNUZ
17   C  CH   CHN  CHNV  CHNVZ
18   C  CI   CIN  CINV  CINVZ
19   C  CI   CIN  CINV  CINVZ

来自 df.value_counts 的多索引:

>>> counts = df.value_counts()
>>> counts

one  two  three  four  five 
A    AD   ADJ    ADJO  ADJOW    2
          ADK    ADKP  ADKPX    2
B    BG   BGN    BGNU  BGNUY    2
C    CI   CIN    CINV  CINVZ    2
A    AD   ADJ    ADJP  ADJPW    1
          ADL    ADLP  ADLPX    1
     AE   AEL    AELQ  AELQX    1
          AEM    AEMQ  AEMQX    1
                 AEMR  AEMRX    1
B    BF   BFM    BFMS  BFMSY    1
                 BFMT  BFMTY    1
          BFN    BFNT  BFNTY    1
     BG   BGN    BGNT  BGNTY    1
                 BGNU  BGNUZ    1
C    CH   CHN    CHNU  CHNUZ    1
                 CHNV  CHNVZ    1

来自 groupby.sum 的楔形总数:

>>> counts.groupby(level=[0]).sum()

one
A    9
B    7
C    4
>>> counts.groupby(level=[0, 1]).sum()

one  two
A    AD     6
     AE     3
B    BF     3
     BG     4
C    CH     2
     CI     2
>>> counts.groupby(level=[0, 1, 2]).sum()

one  two  three
A    AD   ADJ      3
          ADK      2
          ADL      1
     AE   AEL      1
          AEM      2
B    BF   BFM      2
          BFN      1
     BG   BGN      4
C    CH   CHN      2
     CI   CIN      2
>>> counts.groupby(level=[0, 1, 2, 3]).sum()

one  two  three  four
A    AD   ADJ    ADJO    2
                 ADJP    1
          ADK    ADKP    2
          ADL    ADLP    1
     AE   AEL    AELQ    1
          AEM    AEMQ    1
                 AEMR    1
B    BF   BFM    BFMS    1
                 BFMT    1
          BFN    BFNT    1
     BG   BGN    BGNT    1
                 BGNU    3
C    CH   CHN    CHNU    1
                 CHNV    1
     CI   CIN    CINV    2