为什么 DBSCAN.fit() 使用更多功能会更快?
Why is DBSCAN.fit() faster with more features?
我在玩 DBSCAN。我想知道为什么随着特征数量的增加执行时间会减少(见下图)。我预计执行时间会随着功能数量的增加而增加...
import timeit
import functools
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.cluster import DBSCAN
features = [2, 4, 8, 10]
training_examples = [100, 500, 1000,2000]
n_iterations = 10
x = np.asarray(training_examples)
for num_features in features:
average_execution_time = []
for num_training_examples in training_examples:
# generate matrix of random training examples
X = np.random.rand(num_training_examples, num_features)
# generate a symmetric distance matrix
D = euclidean_distances(X, X)
# DBSCAN parameters
eps = 0.5
kmedian_thresh = 0.005
min_samples = 5
db = DBSCAN(eps=eps,
min_samples=min_samples,
metric='precomputed')
# Call timeit
t = timeit.Timer(functools.partial(db.fit, D))
average_execution_time.append(t.timeit(n_iterations) / n_iterations)
y = np.asarray(average_execution_time)
plt.plot(x, y, label='{} features'.format(num_features))
plt.xlabel('No. of Training Examples')
plt.ylabel('DBSCAN.fit() time to Cluster')
plt.title('DBSCAN.fit() avg time to Cluster')
plt.legend()
plt.grid()
plt.show()
DBSCAN算法基本上需要2个参数:
eps: specifies how close points should be to each other to be considered a part of a cluster. It means that if the distance between two points is lower or equal to this value (eps), these points are considered neighbors.
minPoints: the minimum number of points to form a dense region. For example, if we set the minPoints parameter as 5, then we need at least 5 points to form a dense region.
我认为你的问题与这两种类型的参数有关。
eps:如果选择的eps值太小,很大一部分数据将无法聚类。它将被视为异常值,因为不满足创建密集区域的点数。另一方面,如果选择的值太高,集群将合并并且大多数对象将在同一个集群中。应该根据数据集的距离来选择 eps(我们可以使用 k 距离图来找到它),但通常较小的 eps 值更可取。基本上,更大 = 更快。
minPoints:作为一般规则,最小 minPoints 可以从数据集中的多个维度 (D) 导出,因为 minPoints ≥ D + 1。对于有噪声的数据集,较大的值通常更好,并且会形成更显着的集群。 minPoints 的最小值必须为 3,但数据集越大,应选择的 minPoints 值越大。基本上,更大 = 更快。
我在玩 DBSCAN。我想知道为什么随着特征数量的增加执行时间会减少(见下图)。我预计执行时间会随着功能数量的增加而增加...
import timeit
import functools
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.cluster import DBSCAN
features = [2, 4, 8, 10]
training_examples = [100, 500, 1000,2000]
n_iterations = 10
x = np.asarray(training_examples)
for num_features in features:
average_execution_time = []
for num_training_examples in training_examples:
# generate matrix of random training examples
X = np.random.rand(num_training_examples, num_features)
# generate a symmetric distance matrix
D = euclidean_distances(X, X)
# DBSCAN parameters
eps = 0.5
kmedian_thresh = 0.005
min_samples = 5
db = DBSCAN(eps=eps,
min_samples=min_samples,
metric='precomputed')
# Call timeit
t = timeit.Timer(functools.partial(db.fit, D))
average_execution_time.append(t.timeit(n_iterations) / n_iterations)
y = np.asarray(average_execution_time)
plt.plot(x, y, label='{} features'.format(num_features))
plt.xlabel('No. of Training Examples')
plt.ylabel('DBSCAN.fit() time to Cluster')
plt.title('DBSCAN.fit() avg time to Cluster')
plt.legend()
plt.grid()
plt.show()
DBSCAN算法基本上需要2个参数:
eps: specifies how close points should be to each other to be considered a part of a cluster. It means that if the distance between two points is lower or equal to this value (eps), these points are considered neighbors.
minPoints: the minimum number of points to form a dense region. For example, if we set the minPoints parameter as 5, then we need at least 5 points to form a dense region.
我认为你的问题与这两种类型的参数有关。
eps:如果选择的eps值太小,很大一部分数据将无法聚类。它将被视为异常值,因为不满足创建密集区域的点数。另一方面,如果选择的值太高,集群将合并并且大多数对象将在同一个集群中。应该根据数据集的距离来选择 eps(我们可以使用 k 距离图来找到它),但通常较小的 eps 值更可取。基本上,更大 = 更快。
minPoints:作为一般规则,最小 minPoints 可以从数据集中的多个维度 (D) 导出,因为 minPoints ≥ D + 1。对于有噪声的数据集,较大的值通常更好,并且会形成更显着的集群。 minPoints 的最小值必须为 3,但数据集越大,应选择的 minPoints 值越大。基本上,更大 = 更快。