使用 dask 对大数据集进行聚类

clustering large data set using dask

我已经安装了 dask. My main aim is clustering a large dataset, but before starting work on it, I want to make a few tests. However, whenever I want to run a dask code piece, it takes too much time and a memory error appears at the end. I tried their Spectral Clustering Example 和下面的短代码。

你觉得问题是什么?


from dask.distributed import Client
from sklearn.externals.joblib import parallel_backend
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN

import datetime

X, y = make_blobs(n_samples = 150000, n_features = 2, centers = 3, cluster_std = 2.1)
client = Client()

now = datetime.datetime.now()
model = DBSCAN(eps = 0.5, min_samples = 30)
with parallel_backend('dask'):
    model.fit(X)
print(datetime.datetime.now() - now)

Scikit-Learn 算法并非设计用于训练大型数据集。它们旨在对适合内存的数据进行操作。此处对此进行了描述:https://ml.dask.org/#parallelize-scikit-learn-directly

像 Dask ML 这样的项目确实有其他算法 看起来像 Scikit-Learn,但实现方式不同,支持更大的数据集大小。如果您正在寻找集群,那么您可能对此页面感兴趣以查看当前支持的内容:https://ml.dask.org/clustering.html