具有距离阈值停止准则的编辑距离矩阵的单链聚类
Single linkage clustering of edit distance matrix with distance threshold stopping criterion
我正在尝试将平坦的单链接簇分配给由编辑距离 < n 分隔的序列 ID,给定方形距离矩阵。我相信 scipy.cluster.hierarchy.fclusterdata()
和 criterion='distance'
可能是一种方法,但它并没有完全返回我对这个玩具示例所期望的集群。
具体来说,在下面的 4x4 距离矩阵示例中,我希望 clusters_50
(使用 t=50
)创建 2 个聚类,实际上它找到了 3 个聚类。我认为问题在于 fclusterdata()
不期望距离矩阵,但是 fcluster()
似乎也没有做我想要的。
我也看过 sklearn.cluster.AgglomerativeClustering
但这需要指定 n_clusters
,我想根据需要创建尽可能多的集群,直到满足我指定的距离阈值。
我看到有一个 当前未合并 scikit-learn 拉取这个确切功能的请求:https://github.com/scikit-learn/scikit-learn/pull/9069
谁能指出我正确的方向?使用绝对距离阈值标准进行聚类似乎是一个常见的用例。
import pandas as pd
from scipy.cluster.hierarchy import fclusterdata
cols = ['a', 'b', 'c', 'd']
df = pd.DataFrame([{'a': 0, 'b': 29467, 'c': 35, 'd': 13},
{'a': 29467, 'b': 0, 'c': 29468, 'd': 29470},
{'a': 35, 'b': 29468, 'c': 0, 'd': 38},
{'a': 13, 'b': 29470, 'c': 38, 'd': 0}],
index=cols)
clusters_20 = fclusterdata(df.values, t=20, criterion='distance')
clusters_50 = fclusterdata(df.values, t=50, criterion='distance')
clusters_100 = fclusterdata(df.values, t=100, criterion='distance')
names_clusters_20 = {n: c for n, c in zip(cols, clusters_20)}
names_clusters_50 = {n: c for n, c in zip(cols, clusters_50)}
names_clusters_100 = {n: c for n, c in zip(cols, clusters_100)}
names_clusters_20 # Expecting 3 clusters, finds 3
>>> {'a': 1, 'b': 3, 'c': 2, 'd': 1}
names_clusters_50 # Expecting 2 clusters, finds 3
>>> {'a': 1, 'b': 3, 'c': 2, 'd': 1}
names_clusters_100 # Expecting 2 clusters, finds 2
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}
通过将 linkage()
传递给 fcluster()
解决了这个问题,它支持 metric='precomputed'
与 fclusterdata()
不同。
fcluster(linkage(condensed_dm, metric='precomputed'), criterion='distance', t=20)
解决方案:
import pandas as pd
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage, fcluster
cols = ['a', 'b', 'c', 'd']
df = pd.DataFrame([{'a': 0, 'b': 29467, 'c': 35, 'd': 13},
{'a': 29467, 'b': 0, 'c': 29468, 'd': 29470},
{'a': 35, 'b': 29468, 'c': 0, 'd': 38},
{'a': 13, 'b': 29470, 'c': 38, 'd': 0}],
index=cols)
dm_cnd = squareform(df.values)
clusters_20 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=20)
clusters_50 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=50)
clusters_100 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=100)
names_clusters_20 = {n: c for n, c in zip(cols, clusters_20)}
names_clusters_50 = {n: c for n, c in zip(cols, clusters_50)}
names_clusters_100 = {n: c for n, c in zip(cols, clusters_100)}
names_clusters_20
>>> {'a': 1, 'b': 3, 'c': 2, 'd': 1}
names_clusters_50
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}
names_clusters_100
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}
作为函数:
import pandas as pd
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import fcluster, linkage
def cluster_df(df, method='single', threshold=100):
'''
Accepts a square distance matrix as an indexed DataFrame and returns a dict of index keyed flat clusters
Performs single linkage clustering by default, see scipy.cluster.hierarchy.linkage docs for others
'''
dm_cnd = squareform(df.values)
clusters = fcluster(linkage(dm_cnd,
method=method,
metric='precomputed'),
criterion='distance',
t=threshold)
names_clusters = {s:c for s, c in zip(df.columns, clusters)}
return names_clusters
您没有设置公制参数。
然后默认为 metric='euclidean'
,而不是 预先计算的。
我正在尝试将平坦的单链接簇分配给由编辑距离 < n 分隔的序列 ID,给定方形距离矩阵。我相信 scipy.cluster.hierarchy.fclusterdata()
和 criterion='distance'
可能是一种方法,但它并没有完全返回我对这个玩具示例所期望的集群。
具体来说,在下面的 4x4 距离矩阵示例中,我希望 clusters_50
(使用 t=50
)创建 2 个聚类,实际上它找到了 3 个聚类。我认为问题在于 fclusterdata()
不期望距离矩阵,但是 fcluster()
似乎也没有做我想要的。
我也看过 sklearn.cluster.AgglomerativeClustering
但这需要指定 n_clusters
,我想根据需要创建尽可能多的集群,直到满足我指定的距离阈值。
我看到有一个 当前未合并 scikit-learn 拉取这个确切功能的请求:https://github.com/scikit-learn/scikit-learn/pull/9069
谁能指出我正确的方向?使用绝对距离阈值标准进行聚类似乎是一个常见的用例。
import pandas as pd
from scipy.cluster.hierarchy import fclusterdata
cols = ['a', 'b', 'c', 'd']
df = pd.DataFrame([{'a': 0, 'b': 29467, 'c': 35, 'd': 13},
{'a': 29467, 'b': 0, 'c': 29468, 'd': 29470},
{'a': 35, 'b': 29468, 'c': 0, 'd': 38},
{'a': 13, 'b': 29470, 'c': 38, 'd': 0}],
index=cols)
clusters_20 = fclusterdata(df.values, t=20, criterion='distance')
clusters_50 = fclusterdata(df.values, t=50, criterion='distance')
clusters_100 = fclusterdata(df.values, t=100, criterion='distance')
names_clusters_20 = {n: c for n, c in zip(cols, clusters_20)}
names_clusters_50 = {n: c for n, c in zip(cols, clusters_50)}
names_clusters_100 = {n: c for n, c in zip(cols, clusters_100)}
names_clusters_20 # Expecting 3 clusters, finds 3
>>> {'a': 1, 'b': 3, 'c': 2, 'd': 1}
names_clusters_50 # Expecting 2 clusters, finds 3
>>> {'a': 1, 'b': 3, 'c': 2, 'd': 1}
names_clusters_100 # Expecting 2 clusters, finds 2
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}
通过将 linkage()
传递给 fcluster()
解决了这个问题,它支持 metric='precomputed'
与 fclusterdata()
不同。
fcluster(linkage(condensed_dm, metric='precomputed'), criterion='distance', t=20)
解决方案:
import pandas as pd
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage, fcluster
cols = ['a', 'b', 'c', 'd']
df = pd.DataFrame([{'a': 0, 'b': 29467, 'c': 35, 'd': 13},
{'a': 29467, 'b': 0, 'c': 29468, 'd': 29470},
{'a': 35, 'b': 29468, 'c': 0, 'd': 38},
{'a': 13, 'b': 29470, 'c': 38, 'd': 0}],
index=cols)
dm_cnd = squareform(df.values)
clusters_20 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=20)
clusters_50 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=50)
clusters_100 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=100)
names_clusters_20 = {n: c for n, c in zip(cols, clusters_20)}
names_clusters_50 = {n: c for n, c in zip(cols, clusters_50)}
names_clusters_100 = {n: c for n, c in zip(cols, clusters_100)}
names_clusters_20
>>> {'a': 1, 'b': 3, 'c': 2, 'd': 1}
names_clusters_50
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}
names_clusters_100
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}
作为函数:
import pandas as pd
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import fcluster, linkage
def cluster_df(df, method='single', threshold=100):
'''
Accepts a square distance matrix as an indexed DataFrame and returns a dict of index keyed flat clusters
Performs single linkage clustering by default, see scipy.cluster.hierarchy.linkage docs for others
'''
dm_cnd = squareform(df.values)
clusters = fcluster(linkage(dm_cnd,
method=method,
metric='precomputed'),
criterion='distance',
t=threshold)
names_clusters = {s:c for s, c in zip(df.columns, clusters)}
return names_clusters
您没有设置公制参数。
然后默认为 metric='euclidean'
,而不是 预先计算的。