如何构建 pandas 数据框以绘制嵌套 pie/donut 图表?
How to structure a pandas dataframe for plotting nested pie/donut charts?
这很相似,但它已过时并且代码不适用于 Pandas 的当前版本:
这是我正在努力实现的一个常见示例;虽然它不一定是准确的:
我正在尝试创建一个看起来像这样但带有标签的图表。我知道每个级别的标签都是荒谬的,所以我正在寻找一种方法来说明特定数量下的任何内容都将被归类为“其他”:
https://matplotlib.org/3.5.1/gallery/pie_and_polar_charts/nested_pie.html
我有以下 table:
https://pastebin.com/raw/vC5C355D
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("https://pastebin.com/raw/vC5C355D", sep="\t", index_col=0)
老实说,我什至不知道从哪里开始。在该层次结构顺序中有 5 个不同的层次结构级别 [class, order, family, genus, species]
。
我是否遍历每个级别并对每一列执行 .value_counts()
?如果是这样,层次结构是如何保存的?我不确定如何构建数据框来绘制它。
有人可以在以下方面提供一些帮助 1) 构建数据框以便它可以用于分层 pie/donut 图表;和 2) 如何使 the documentation 适应所述数据框?
how to structure the dataframe so it can be used for hierarchical pie/donut charts
这是分层的理想情况 MultiIndex:
使用 df.value_counts
在 MultiIndex 中生成计数(每个级别一个特征):
counts = df.value_counts() # long output shown at bottom of post
然后楔值可以简单地用groupby.sum
计算,例如对于 2 级:
counts.groupby(level=[0, 1, 2]).sum() # long output shown at bottom of post
matplotlib nested donut demo 使用与 numpy 数组相同的概念(每个矩阵维度一个特征),但对于更高的维度来说太笨重了。将计数构造为 n 级多索引比 n 维数组要简单得多。
how to adapt the documentation to said dataframe
更新:代码现在根据根节点为楔形着色:
转换原始 DataFrame 的完整代码 -> 嵌套甜甜圈(带有更易于管理的演示示例):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
WEDGE_SIZE = 0.5
LABEL_THRESHOLD = 1
df = pd.DataFrame({'one': list('AAAAAAAAABBBBBBBCCCC'), 'two': list('DDDDDDEEEFFFGGGGHHII'), 'three': list('JJJKKLLMMMMNNNNNNNNN'), 'four': list('OOPPPPQQRSTTTUUUUVVV'), 'five': list('WWWXXXXXXYYYYYYZZZZZ')}).cumsum(1)
fig, ax = plt.subplots()
# generate MultiIndex of counts with one feature per level
counts = df.value_counts()
# define primary colormaps (cycle if levels > 6)
cmaps = np.resize(['Blues_r', 'Greens_r', 'Oranges_r', 'Purples_r', 'Reds_r', 'Greys_r'],
counts.index.get_level_values(0).size)
for level in range(len(counts.index.names)):
# compute grouped sums up to current level
wedges = counts.groupby(level=list(range(level+1))).sum()
# extract annotation labels from MultiIndex
labels = wedges.index.get_level_values(level)
# generate color shades per group
index = [(i,) if level == 0 else i for i in wedges.index.tolist()] # standardize Index vs MultiIndex
g0 = pd.DataFrame.from_records(index).groupby(0)
maps = g0.ngroup()
shades = g0.cumcount() / g0.size().max()
colors = [plt.get_cmap(cmaps[m])(s) for m, s in zip(maps, shades)]
# plot colorized/labeled donut layer
ax.pie(x=wedges,
radius=1 + (level * WEDGE_SIZE),
colors=colors,
labels=np.where(wedges >= LABEL_THRESHOLD, labels, ''), # unlabel if under threshold
rotatelabels=True,
labeldistance=1.1 - 1.4/(level+3.5), # put labels inside wedge instead of outside (requires manual tweaking)
wedgeprops=dict(width=WEDGE_SIZE, linewidth=0, alpha=0.33))
请注意,您的示例数据映射到大量楔形(外层 = 199 种),因此将较小的值聚合为“其他”实际上不会起作用。楔子基本上都是一样的小尺寸,所以我不确定如何合理标记这个完整的样本。
左边是完整样本,右边是较小的子集:
作为参考,这些是 df
-> df.value_counts
-> groupby.sum
.
的输出
原文df
:
>>> df = pd.DataFrame({'one': list('AAAAAAAAABBBBBBBCCCC'), 'two': list('DDDDDDEEEFFFGGGGHHII'), 'three': list('JJJKKLLMMMMNNNNNNNNN'), 'four': list('OOPPPPQQRSTTTUUUUVVV'), 'five': list('WWWXXXXXXYYYYYYZZZZZ')}).cumsum(1)
>>> df
one two three four five
0 A AD ADJ ADJO ADJOW
1 A AD ADJ ADJO ADJOW
2 A AD ADJ ADJP ADJPW
3 A AD ADK ADKP ADKPX
4 A AD ADK ADKP ADKPX
5 A AD ADL ADLP ADLPX
6 A AE AEL AELQ AELQX
7 A AE AEM AEMQ AEMQX
8 A AE AEM AEMR AEMRX
9 B BF BFM BFMS BFMSY
10 B BF BFM BFMT BFMTY
11 B BF BFN BFNT BFNTY
12 B BG BGN BGNT BGNTY
13 B BG BGN BGNU BGNUY
14 B BG BGN BGNU BGNUY
15 B BG BGN BGNU BGNUZ
16 C CH CHN CHNU CHNUZ
17 C CH CHN CHNV CHNVZ
18 C CI CIN CINV CINVZ
19 C CI CIN CINV CINVZ
来自 df.value_counts
的多索引:
>>> counts = df.value_counts()
>>> counts
one two three four five
A AD ADJ ADJO ADJOW 2
ADK ADKP ADKPX 2
B BG BGN BGNU BGNUY 2
C CI CIN CINV CINVZ 2
A AD ADJ ADJP ADJPW 1
ADL ADLP ADLPX 1
AE AEL AELQ AELQX 1
AEM AEMQ AEMQX 1
AEMR AEMRX 1
B BF BFM BFMS BFMSY 1
BFMT BFMTY 1
BFN BFNT BFNTY 1
BG BGN BGNT BGNTY 1
BGNU BGNUZ 1
C CH CHN CHNU CHNUZ 1
CHNV CHNVZ 1
来自 groupby.sum
的楔形总数:
>>> counts.groupby(level=[0]).sum()
one
A 9
B 7
C 4
>>> counts.groupby(level=[0, 1]).sum()
one two
A AD 6
AE 3
B BF 3
BG 4
C CH 2
CI 2
>>> counts.groupby(level=[0, 1, 2]).sum()
one two three
A AD ADJ 3
ADK 2
ADL 1
AE AEL 1
AEM 2
B BF BFM 2
BFN 1
BG BGN 4
C CH CHN 2
CI CIN 2
>>> counts.groupby(level=[0, 1, 2, 3]).sum()
one two three four
A AD ADJ ADJO 2
ADJP 1
ADK ADKP 2
ADL ADLP 1
AE AEL AELQ 1
AEM AEMQ 1
AEMR 1
B BF BFM BFMS 1
BFMT 1
BFN BFNT 1
BG BGN BGNT 1
BGNU 3
C CH CHN CHNU 1
CHNV 1
CI CIN CINV 2
这很相似,但它已过时并且代码不适用于 Pandas 的当前版本:
这是我正在努力实现的一个常见示例;虽然它不一定是准确的:
我正在尝试创建一个看起来像这样但带有标签的图表。我知道每个级别的标签都是荒谬的,所以我正在寻找一种方法来说明特定数量下的任何内容都将被归类为“其他”: https://matplotlib.org/3.5.1/gallery/pie_and_polar_charts/nested_pie.html
我有以下 table: https://pastebin.com/raw/vC5C355D
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("https://pastebin.com/raw/vC5C355D", sep="\t", index_col=0)
老实说,我什至不知道从哪里开始。在该层次结构顺序中有 5 个不同的层次结构级别 [class, order, family, genus, species]
。
我是否遍历每个级别并对每一列执行 .value_counts()
?如果是这样,层次结构是如何保存的?我不确定如何构建数据框来绘制它。
有人可以在以下方面提供一些帮助 1) 构建数据框以便它可以用于分层 pie/donut 图表;和 2) 如何使 the documentation 适应所述数据框?
how to structure the dataframe so it can be used for hierarchical pie/donut charts
这是分层的理想情况 MultiIndex:
使用
df.value_counts
在 MultiIndex 中生成计数(每个级别一个特征):counts = df.value_counts() # long output shown at bottom of post
然后楔值可以简单地用
groupby.sum
计算,例如对于 2 级:counts.groupby(level=[0, 1, 2]).sum() # long output shown at bottom of post
matplotlib nested donut demo 使用与 numpy 数组相同的概念(每个矩阵维度一个特征),但对于更高的维度来说太笨重了。将计数构造为 n 级多索引比 n 维数组要简单得多。
how to adapt the documentation to said dataframe
更新:代码现在根据根节点为楔形着色:
转换原始 DataFrame 的完整代码 -> 嵌套甜甜圈(带有更易于管理的演示示例):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
WEDGE_SIZE = 0.5
LABEL_THRESHOLD = 1
df = pd.DataFrame({'one': list('AAAAAAAAABBBBBBBCCCC'), 'two': list('DDDDDDEEEFFFGGGGHHII'), 'three': list('JJJKKLLMMMMNNNNNNNNN'), 'four': list('OOPPPPQQRSTTTUUUUVVV'), 'five': list('WWWXXXXXXYYYYYYZZZZZ')}).cumsum(1)
fig, ax = plt.subplots()
# generate MultiIndex of counts with one feature per level
counts = df.value_counts()
# define primary colormaps (cycle if levels > 6)
cmaps = np.resize(['Blues_r', 'Greens_r', 'Oranges_r', 'Purples_r', 'Reds_r', 'Greys_r'],
counts.index.get_level_values(0).size)
for level in range(len(counts.index.names)):
# compute grouped sums up to current level
wedges = counts.groupby(level=list(range(level+1))).sum()
# extract annotation labels from MultiIndex
labels = wedges.index.get_level_values(level)
# generate color shades per group
index = [(i,) if level == 0 else i for i in wedges.index.tolist()] # standardize Index vs MultiIndex
g0 = pd.DataFrame.from_records(index).groupby(0)
maps = g0.ngroup()
shades = g0.cumcount() / g0.size().max()
colors = [plt.get_cmap(cmaps[m])(s) for m, s in zip(maps, shades)]
# plot colorized/labeled donut layer
ax.pie(x=wedges,
radius=1 + (level * WEDGE_SIZE),
colors=colors,
labels=np.where(wedges >= LABEL_THRESHOLD, labels, ''), # unlabel if under threshold
rotatelabels=True,
labeldistance=1.1 - 1.4/(level+3.5), # put labels inside wedge instead of outside (requires manual tweaking)
wedgeprops=dict(width=WEDGE_SIZE, linewidth=0, alpha=0.33))
请注意,您的示例数据映射到大量楔形(外层 = 199 种),因此将较小的值聚合为“其他”实际上不会起作用。楔子基本上都是一样的小尺寸,所以我不确定如何合理标记这个完整的样本。
左边是完整样本,右边是较小的子集:
作为参考,这些是 df
-> df.value_counts
-> groupby.sum
.
原文df
:
>>> df = pd.DataFrame({'one': list('AAAAAAAAABBBBBBBCCCC'), 'two': list('DDDDDDEEEFFFGGGGHHII'), 'three': list('JJJKKLLMMMMNNNNNNNNN'), 'four': list('OOPPPPQQRSTTTUUUUVVV'), 'five': list('WWWXXXXXXYYYYYYZZZZZ')}).cumsum(1)
>>> df
one two three four five
0 A AD ADJ ADJO ADJOW
1 A AD ADJ ADJO ADJOW
2 A AD ADJ ADJP ADJPW
3 A AD ADK ADKP ADKPX
4 A AD ADK ADKP ADKPX
5 A AD ADL ADLP ADLPX
6 A AE AEL AELQ AELQX
7 A AE AEM AEMQ AEMQX
8 A AE AEM AEMR AEMRX
9 B BF BFM BFMS BFMSY
10 B BF BFM BFMT BFMTY
11 B BF BFN BFNT BFNTY
12 B BG BGN BGNT BGNTY
13 B BG BGN BGNU BGNUY
14 B BG BGN BGNU BGNUY
15 B BG BGN BGNU BGNUZ
16 C CH CHN CHNU CHNUZ
17 C CH CHN CHNV CHNVZ
18 C CI CIN CINV CINVZ
19 C CI CIN CINV CINVZ
来自 df.value_counts
的多索引:
>>> counts = df.value_counts()
>>> counts
one two three four five
A AD ADJ ADJO ADJOW 2
ADK ADKP ADKPX 2
B BG BGN BGNU BGNUY 2
C CI CIN CINV CINVZ 2
A AD ADJ ADJP ADJPW 1
ADL ADLP ADLPX 1
AE AEL AELQ AELQX 1
AEM AEMQ AEMQX 1
AEMR AEMRX 1
B BF BFM BFMS BFMSY 1
BFMT BFMTY 1
BFN BFNT BFNTY 1
BG BGN BGNT BGNTY 1
BGNU BGNUZ 1
C CH CHN CHNU CHNUZ 1
CHNV CHNVZ 1
来自 groupby.sum
的楔形总数:
>>> counts.groupby(level=[0]).sum()
one
A 9
B 7
C 4
>>> counts.groupby(level=[0, 1]).sum()
one two
A AD 6
AE 3
B BF 3
BG 4
C CH 2
CI 2
>>> counts.groupby(level=[0, 1, 2]).sum()
one two three
A AD ADJ 3
ADK 2
ADL 1
AE AEL 1
AEM 2
B BF BFM 2
BFN 1
BG BGN 4
C CH CHN 2
CI CIN 2
>>> counts.groupby(level=[0, 1, 2, 3]).sum()
one two three four
A AD ADJ ADJO 2
ADJP 1
ADK ADKP 2
ADL ADLP 1
AE AEL AELQ 1
AEM AEMQ 1
AEMR 1
B BF BFM BFMS 1
BFMT 1
BFN BFNT 1
BG BGN BGNT 1
BGNU 3
C CH CHN CHNU 1
CHNV 1
CI CIN CINV 2