如何使用 Plotly 制作一个只有一层的桑基图?
How do I make a Sankey diagram with Plotly with one layer that goes only one level?
我想制作一个分为不同级别的 Sankey 图(很明显),但是其中一个级别应该在一个级别之后停止,因为进一步的步骤不适用。很像这样:
import pandas as pd
pd.DataFrame({
'kind': ['not an animal', 'animal', 'animal', 'animal', 'animal'],
'animal': ['?', 'cat', 'cat', 'dog', 'cat'],
'sex': ['?', 'female', 'female', 'male', 'male'],
'status': ['?', 'domesticated', 'domesticated', 'wild', 'domesticated'],
'count': [8, 10, 11, 14, 6]
})
kind animal sex status count
0 not an animal ? ? ? 8
1 animal cat female domesticated 10
2 animal cat female domesticated 11
3 animal dog male wild 14
4 animal cat male domesticated 6
“不是动物”不应该进一步拆分,因为它们不适用。它应该如下所示:
- 重用我在这个答案中使用的结构
- 将有问题的数据框重新构造为:
source
target
count
0
animal
cat
27
1
animal
dog
14
2
cat
female
21
3
cat
male
6
4
dog
male
14
5
female
domesticated
21
6
male
domesticated
6
7
male
wild
14
8
not an animal
?
8
- 那么它就变成了构建节点和链接数组的情况
完整代码
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import io
df2 = pd.read_csv(
io.StringIO(
""" kind animal sex status count
0 not an animal ? ? ? 8
1 animal cat female domesticated 10
2 animal cat female domesticated 11
3 animal dog male wild 14
4 animal cat male domesticated 6"""
),
sep="\s\s+",
engine="python",
)
df = (
pd.concat(
[
df2.loc[:, [c1, c2] + ["count"]].rename(
columns={c1: "source", c2: "target"}
)
for c1, c2 in zip(df2.columns[:-1], df2.columns[1:-1])
]
)
.loc[lambda d: ~d["source"].eq("?")]
.groupby(["source", "target"], as_index=False)
.sum()
)
nodes = np.unique(df[["source", "target"]], axis=None)
nodes = pd.Series(index=nodes, data=range(len(nodes)))
go.Figure(
go.Sankey(
node={"label": nodes.index},
link={
"source": nodes.loc[df["source"]],
"target": nodes.loc[df["target"]],
"value": df["count"],
},
)
)
分阶段构建数据帧
col_pairs = [[c1, c2] for c1, c2 in zip(df2.columns[:-1], df2.columns[1:-1])]
# reconstruct as source / target pairs
df = pd.concat(
[
df2.loc[:, cols + ["count"]].rename(
columns={cols[0]: "source", cols[1]: "target"}
)
for cols in col_pairs
]
)
# filter out where source is unknown
df = df.loc[~df["source"].eq("?")]
# aggregate to limit links in sankey
df = df.groupby(["source", "target"], as_index=False).sum()
我想制作一个分为不同级别的 Sankey 图(很明显),但是其中一个级别应该在一个级别之后停止,因为进一步的步骤不适用。很像这样:
import pandas as pd
pd.DataFrame({
'kind': ['not an animal', 'animal', 'animal', 'animal', 'animal'],
'animal': ['?', 'cat', 'cat', 'dog', 'cat'],
'sex': ['?', 'female', 'female', 'male', 'male'],
'status': ['?', 'domesticated', 'domesticated', 'wild', 'domesticated'],
'count': [8, 10, 11, 14, 6]
})
kind animal sex status count
0 not an animal ? ? ? 8
1 animal cat female domesticated 10
2 animal cat female domesticated 11
3 animal dog male wild 14
4 animal cat male domesticated 6
“不是动物”不应该进一步拆分,因为它们不适用。它应该如下所示:
- 重用我在这个答案中使用的结构
- 将有问题的数据框重新构造为:
source | target | count | |
---|---|---|---|
0 | animal | cat | 27 |
1 | animal | dog | 14 |
2 | cat | female | 21 |
3 | cat | male | 6 |
4 | dog | male | 14 |
5 | female | domesticated | 21 |
6 | male | domesticated | 6 |
7 | male | wild | 14 |
8 | not an animal | ? | 8 |
- 那么它就变成了构建节点和链接数组的情况
完整代码
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import io
df2 = pd.read_csv(
io.StringIO(
""" kind animal sex status count
0 not an animal ? ? ? 8
1 animal cat female domesticated 10
2 animal cat female domesticated 11
3 animal dog male wild 14
4 animal cat male domesticated 6"""
),
sep="\s\s+",
engine="python",
)
df = (
pd.concat(
[
df2.loc[:, [c1, c2] + ["count"]].rename(
columns={c1: "source", c2: "target"}
)
for c1, c2 in zip(df2.columns[:-1], df2.columns[1:-1])
]
)
.loc[lambda d: ~d["source"].eq("?")]
.groupby(["source", "target"], as_index=False)
.sum()
)
nodes = np.unique(df[["source", "target"]], axis=None)
nodes = pd.Series(index=nodes, data=range(len(nodes)))
go.Figure(
go.Sankey(
node={"label": nodes.index},
link={
"source": nodes.loc[df["source"]],
"target": nodes.loc[df["target"]],
"value": df["count"],
},
)
)
分阶段构建数据帧
col_pairs = [[c1, c2] for c1, c2 in zip(df2.columns[:-1], df2.columns[1:-1])]
# reconstruct as source / target pairs
df = pd.concat(
[
df2.loc[:, cols + ["count"]].rename(
columns={cols[0]: "source", cols[1]: "target"}
)
for cols in col_pairs
]
)
# filter out where source is unknown
df = df.loc[~df["source"].eq("?")]
# aggregate to limit links in sankey
df = df.groupby(["source", "target"], as_index=False).sum()