sklearn DBSCAN 将 GPS 位置与大 epsilon 聚类
sklearn DBSCAN to cluster GPS positions with big epsilon
我想使用 sklearn 中的 DBSCAN 从我的 GPS 位置查找集群。我不明白为什么坐标 [ 18.28, 57.63] (图中右下角)和左边的其他坐标一起聚集在一起。大 epsilon 会不会有什么问题? sklearn 版本 0.19.0。
要重现此内容:
我从这里复制了演示代码:http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html but I replaced the sample data with a few coordinates (see variable X in the code below). I got the inspiration from here: http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
# #############################################################################
# Generate sample data
X = np.array([[ 11.95, 57.70],
[ 16.28, 57.63],
[ 16.27, 57.63],
[ 16.28, 57.66],
[ 11.95, 57.63],
[ 12.95, 57.63],
[ 18.28, 57.63],
[ 11.97, 57.70]])
# #############################################################################
# Compute DBSCAN
kms_per_radian = 6371.0088
epsilon = 400 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=2, algorithm='ball_tree', metric='haversine').fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
haversine 指标需要弧度数据
我最近犯了同样的错误(使用 hdbscan),这是一些 'strange' 结果的原因。例如,same 点有时会包含在一个簇中,有时会被标记为噪声点。 "How can this be?",我一直在想。原来是因为我是直接传lat/lon,没有先转成弧度
OP 的自供答案是正确的,但缺少细节。当然,可以将 lat/lon 值乘以 π/180,但是——如果您已经在使用 numpy
——最简单的解决方法是更改原始代码中的这一行:
db = DBSCAN(eps=epsilon, ... metric='haversine').fit(X)
至:
db = DBSCAN(eps=epsilon, ... metric='haversine').fit(np.radians(X))
我想使用 sklearn 中的 DBSCAN 从我的 GPS 位置查找集群。我不明白为什么坐标 [ 18.28, 57.63] (图中右下角)和左边的其他坐标一起聚集在一起。大 epsilon 会不会有什么问题? sklearn 版本 0.19.0。
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
# #############################################################################
# Generate sample data
X = np.array([[ 11.95, 57.70],
[ 16.28, 57.63],
[ 16.27, 57.63],
[ 16.28, 57.66],
[ 11.95, 57.63],
[ 12.95, 57.63],
[ 18.28, 57.63],
[ 11.97, 57.70]])
# #############################################################################
# Compute DBSCAN
kms_per_radian = 6371.0088
epsilon = 400 / kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=2, algorithm='ball_tree', metric='haversine').fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
haversine 指标需要弧度数据
我最近犯了同样的错误(使用 hdbscan),这是一些 'strange' 结果的原因。例如,same 点有时会包含在一个簇中,有时会被标记为噪声点。 "How can this be?",我一直在想。原来是因为我是直接传lat/lon,没有先转成弧度
OP 的自供答案是正确的,但缺少细节。当然,可以将 lat/lon 值乘以 π/180,但是——如果您已经在使用 numpy
——最简单的解决方法是更改原始代码中的这一行:
db = DBSCAN(eps=epsilon, ... metric='haversine').fit(X)
至:
db = DBSCAN(eps=epsilon, ... metric='haversine').fit(np.radians(X))