Seaborn 散点图矩阵 - 使用自定义样式添加额外的点
Seaborn scatterplot matrix - adding extra points with custom styles
我正在 GitHub 上对一些开源项目的活动进行 k 均值聚类,并尝试使用 Seaborn Scatterplot Matrix.[=15= 将结果与聚类质心一起绘制]
我可以成功绘制聚类分析的结果(下面的 tsv 输出示例)
user_id issue_comments issues_created pull_request_review_comments pull_requests category
1 0.14936519790888722 2.0100502512562812 0.0 0.60790273556231 Group 0
1882 0.11202389843166542 0.5025125628140703 0.0 0.0 Group 1
2 2.315160567587752 20.603015075376884 0.13297872340425532 1.21580547112462 Group 2
1789 36.8185212845407 82.91457286432161 75.66489361702128 74.46808510638297 Group 3
我遇到的问题是我也希望能够在矩阵图上绘制簇的质心。目前我的绘图脚本如下所示:
import seaborn as sns
import pandas as pd
from pylab import savefig
sns.set()
# By default, Pandas assumes the first column is an index
# so it will be skipped. In our case it's the user_id
data = pd.DataFrame.from_csv('summary_clusters.tsv', sep='\t')
grid = sns.pairplot(data, hue="category", diag_kind="kde")
savefig('normalised_clusters.png', dpi = 150)
这会产生预期的输出:
我希望能够在每个图上标记簇的 质心。我可以想到两种方法来做到这一点:
- 创建一个新的 'CENTROID' 类别,然后将其与其他点一起绘制。
- 调用
sns.pairplot(data, hue="category", diag_kind="kde")
后手动为绘图添加额外的点。
如果 (1) 是解决方案,那么我希望能够自定义标记(也许是星星?)以使其更加突出。
如果(2)我洗耳恭听。我对 Seaborn 和 Matplotlib 还很陌生,所以非常欢迎任何帮助:-)
pairplot
不会完全适合这种事情,但可以通过一些技巧使其工作。这是我会做的。
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
sns.set_color_codes()
# Make some random iid data
cov = np.eye(3)
ds = np.vstack([np.random.multivariate_normal([0, 0, 0], cov, 50),
np.random.multivariate_normal([1, 1, 1], cov, 50)])
ds = pd.DataFrame(ds, columns=["x", "y", "z"])
# Fit the k means model and label the observations
km = KMeans(2).fit(ds)
ds["label"] = km.labels_.astype(str)
现在是不明显的部分:您需要创建一个包含质心位置的数据框,然后将其与观察数据框结合起来,同时使用 label
列适当地识别质心:
centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
centroids["label"] = ["0 centroid", "1 centroid"]
full_ds = pd.concat([ds, centroids], ignore_index=True)
那么你只需要使用 PairGrid
,它比 pairplot
更灵活一点,并且允许你通过 hue 变量连同颜色映射其他绘图属性(代价是无法在对角线上绘制直方图):
g = sns.PairGrid(full_ds, hue="label",
hue_order=["0", "1", "0 centroid", "1 centroid"],
palette=["b", "r", "b", "r"],
hue_kws={"s": [20, 20, 500, 500],
"marker": ["o", "o", "*", "*"]})
g.map(plt.scatter, linewidth=1, edgecolor="w")
g.add_legend()
另一种解决方案是正常绘制观察结果,然后更改 PairGrid
对象上的数据属性并添加一个新层。我会称之为 hack,但在某些方面它更直接。
# Plot the data
g = sns.pairplot(ds, hue="label", vars=["x", "y", "z"], palette=["b", "r"])
# Change the PairGrid dataset and add a new layer
centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
g.data = centroids
g.hue_vals = [0, 1]
g.map_offdiag(plt.scatter, s=500, marker="*")
我知道我来晚了一点,但这里是 mwaskom 代码的通用版本,用于处理 n 个集群。可能会节省某人几分钟
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
def cluster_scatter_matrix(data_norm, cluster_number):
sns.set_color_codes()
km = KMeans(cluster_number).fit(data_norm)
data_norm["label"] = km.labels_.astype(str)
centroids = pd.DataFrame(km.cluster_centers_, columns=data_norm.columns)
centroids["label"] = [str(n)+" centroid" for n in range(cluster_number)]
full_ds = pd.concat([data_norm, centroids], ignore_index=True)
g = sns.PairGrid(full_ds, hue="label",
hue_order=[str(n) for n in range(cluster_number)]+[str(n)+" centroid" for n in range(cluster_number)],
#palette=["b", "r", "b", "r"],
hue_kws={"s": [ 20 for n in range(cluster_number)]+[500 for n in range(cluster_number)],
"marker": [ 'o' for n in range(cluster_number)]+['*' for n in range(cluster_number)]}
)
g.map(plt.scatter, linewidth=1, edgecolor="w")
g.add_legend()
我正在 GitHub 上对一些开源项目的活动进行 k 均值聚类,并尝试使用 Seaborn Scatterplot Matrix.[=15= 将结果与聚类质心一起绘制]
我可以成功绘制聚类分析的结果(下面的 tsv 输出示例)
user_id issue_comments issues_created pull_request_review_comments pull_requests category
1 0.14936519790888722 2.0100502512562812 0.0 0.60790273556231 Group 0
1882 0.11202389843166542 0.5025125628140703 0.0 0.0 Group 1
2 2.315160567587752 20.603015075376884 0.13297872340425532 1.21580547112462 Group 2
1789 36.8185212845407 82.91457286432161 75.66489361702128 74.46808510638297 Group 3
我遇到的问题是我也希望能够在矩阵图上绘制簇的质心。目前我的绘图脚本如下所示:
import seaborn as sns
import pandas as pd
from pylab import savefig
sns.set()
# By default, Pandas assumes the first column is an index
# so it will be skipped. In our case it's the user_id
data = pd.DataFrame.from_csv('summary_clusters.tsv', sep='\t')
grid = sns.pairplot(data, hue="category", diag_kind="kde")
savefig('normalised_clusters.png', dpi = 150)
这会产生预期的输出:
我希望能够在每个图上标记簇的 质心。我可以想到两种方法来做到这一点:
- 创建一个新的 'CENTROID' 类别,然后将其与其他点一起绘制。
- 调用
sns.pairplot(data, hue="category", diag_kind="kde")
后手动为绘图添加额外的点。
如果 (1) 是解决方案,那么我希望能够自定义标记(也许是星星?)以使其更加突出。
如果(2)我洗耳恭听。我对 Seaborn 和 Matplotlib 还很陌生,所以非常欢迎任何帮助:-)
pairplot
不会完全适合这种事情,但可以通过一些技巧使其工作。这是我会做的。
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
sns.set_color_codes()
# Make some random iid data
cov = np.eye(3)
ds = np.vstack([np.random.multivariate_normal([0, 0, 0], cov, 50),
np.random.multivariate_normal([1, 1, 1], cov, 50)])
ds = pd.DataFrame(ds, columns=["x", "y", "z"])
# Fit the k means model and label the observations
km = KMeans(2).fit(ds)
ds["label"] = km.labels_.astype(str)
现在是不明显的部分:您需要创建一个包含质心位置的数据框,然后将其与观察数据框结合起来,同时使用 label
列适当地识别质心:
centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
centroids["label"] = ["0 centroid", "1 centroid"]
full_ds = pd.concat([ds, centroids], ignore_index=True)
那么你只需要使用 PairGrid
,它比 pairplot
更灵活一点,并且允许你通过 hue 变量连同颜色映射其他绘图属性(代价是无法在对角线上绘制直方图):
g = sns.PairGrid(full_ds, hue="label",
hue_order=["0", "1", "0 centroid", "1 centroid"],
palette=["b", "r", "b", "r"],
hue_kws={"s": [20, 20, 500, 500],
"marker": ["o", "o", "*", "*"]})
g.map(plt.scatter, linewidth=1, edgecolor="w")
g.add_legend()
另一种解决方案是正常绘制观察结果,然后更改 PairGrid
对象上的数据属性并添加一个新层。我会称之为 hack,但在某些方面它更直接。
# Plot the data
g = sns.pairplot(ds, hue="label", vars=["x", "y", "z"], palette=["b", "r"])
# Change the PairGrid dataset and add a new layer
centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
g.data = centroids
g.hue_vals = [0, 1]
g.map_offdiag(plt.scatter, s=500, marker="*")
我知道我来晚了一点,但这里是 mwaskom 代码的通用版本,用于处理 n 个集群。可能会节省某人几分钟
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
def cluster_scatter_matrix(data_norm, cluster_number):
sns.set_color_codes()
km = KMeans(cluster_number).fit(data_norm)
data_norm["label"] = km.labels_.astype(str)
centroids = pd.DataFrame(km.cluster_centers_, columns=data_norm.columns)
centroids["label"] = [str(n)+" centroid" for n in range(cluster_number)]
full_ds = pd.concat([data_norm, centroids], ignore_index=True)
g = sns.PairGrid(full_ds, hue="label",
hue_order=[str(n) for n in range(cluster_number)]+[str(n)+" centroid" for n in range(cluster_number)],
#palette=["b", "r", "b", "r"],
hue_kws={"s": [ 20 for n in range(cluster_number)]+[500 for n in range(cluster_number)],
"marker": [ 'o' for n in range(cluster_number)]+['*' for n in range(cluster_number)]}
)
g.map(plt.scatter, linewidth=1, edgecolor="w")
g.add_legend()