我如何将 SIFT 描述符与 Apache Spark kmeans 聚类（通过或不通过 pickle）

Question

我使用 OpenCV 3.1 计算了一批图像的 SIFT 描述符。每个描述符都有一个形状 (x, 128)，我使用基于 pickle 的 .tofile 函数将每个描述符写入磁盘。在图像样本中，x 在 2000 和 3000 之间

我希望通过 pyspark 使用 Apache Spark 的 kmeans 集群，但我的问题分为两部分。

我感兴趣的是 python2 代码的序列，假设在描述符生成代码和集群环境之间存在一些公共存储

Answer 1

Is pickling the best way to transfer the descriptor data?

best 在这里非常具体。你可以试试 pickle 或 protobuf。

How do I get from the bunch of pickle files to a cluster ready dataset?

例如，LOPQ 人员，请执行以下操作：

C0 = KMeans.train(first, V, initializationMode='random', maxIterations=10, seed=seed)

其中 first 是我提到的 RDD，V 是簇数，C0 是计算的簇（在 GitHub 的第 67 行检查） .

How can I cluster SIFT descriptors with Apache Spark kmeans (via pickle or not)