sklearn.cluster.DBSCAN的eps参数如何定义取值范围？

Question

我想将 DBSCAN 与度量 sklearn.metrics.pairwise.cosine_similarity 一起使用，以聚类余弦相似度接近 1 的点（即其矢量（来自 "the" 原点）平行或几乎平行）。

问题：

eps 是两个样本之间的最大距离，DBSCAN 认为它们在同一邻域内 - 这意味着如果两点之间的距离 小于或等于 eps，这些点被认为是邻居；

但是

sklearn.metrics.pairwise.cosine_similarity 吐出 -1 和 1 之间的值，我希望 DBSCAN 将两个点视为邻居，如果它们之间的距离在 0.75 和 1 之间 - 即 更大小于等于 0.75.

我看到两个可能的解决方案：

将一系列值传递给 DBSCAN 的 eps 参数，例如每股收益=[0.75,1]
将值 eps=-0.75 传递给 DBSCAN，但（以某种方式）强制它使用由 sklearn.metrics.pairwise.cosine_similarity
[ 吐出的余弦相似矩阵的负数=39=]

我不知道如何实现其中任何一个。

任何指导将不胜感激！

Answer 1

DBSCAN 有一个 metric 关键字参数。文档字符串：

metric : string, or callable The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is "precomputed", X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only "nonzero" elements may be considered neighbors for DBSCAN.

所以最简单的方法可能是使用余弦相似度作为距离度量来预先计算距离矩阵，预处理距离矩阵使其符合您定制的距离标准（可能类似于 D = np.abs(np.abs(CD) -1)，其中 CD是你的余弦距离矩阵），然后将 metric 设置为 precomputed，并将预先计算的距离矩阵 D 传递给 X，即数据。

例如：

#!/usr/bin/env python

import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN

total_samples = 1000
dimensionality = 3
points = np.random.rand(total_samples, dimensionality)

cosine_distance = cosine_similarity(points)

# option 1) vectors are close to each other if they are parallel
bespoke_distance = np.abs(np.abs(cosine_distance) -1)

# option 2) vectors are close to each other if they point in the same direction
bespoke_distance = np.abs(cosine_distance - 1)

results = DBSCAN(metric='precomputed', eps=0.25).fit(bespoke_distance)

Answer 2

A) 查看广义 DBSCAN，它也可以很好地处理相似之处。使用余弦，sklearn 无论如何都会很慢。

B) 您可以简单地使用：余弦距离 = 1 - 余弦相似度。但这很可能导致 sklearn 实现在 O(n²) 中运行。

C) 你甚至可以将 -cosinesimilarity 作为预先计算的距离矩阵并使用 -0.75 作为 eps.

d) 只需制作一个二元距离矩阵（虽然在 O(n²) 内存中，太慢了），其中余弦相似度的距离 = 0 大于您的阈值，否则为 0。然后使用 eps=0.5 的 DBSCAN。当且仅当相似度 > 阈值时，证明距离 < eps 是微不足道的。

Answer 3

几个选项：

dist = np.abs(cos_sim - 1) 在此处接受答案
dist = np.arccos(cos_sim) / np.pi https://math.stackexchange.com/a/3385463/816178
dist = 1 - (sim + 1) / 2 https://math.stackexchange.com/q/3241174/816178

我发现它们在这个应用程序的实践中都是一样的（层次聚类中的预计算距离；我也遇到了障碍）。据我了解，#2 是更 mathematically-correct 的方法；保留 angular 距离。

sklearn.cluster.DBSCAN的eps参数如何定义取值范围？

How to define a range of values for the eps parameter of sklearn.cluster.DBSCAN?

python

parameters

range

dbscan

scikit-learn