Spark MLlib - 从 RDD[Vector] 特征和 RDD[Vector] 标签创建 LabeledPoint

Question

我正在使用代表文档和标签的两个文本文件构建训练集。

Documents.txt

hello world
hello mars

Labels.txt

0
1

我已阅读这些文件并将我的文档数据转换为 tf-idf 加权 term-document matrix，表示为 RDD[Vector]。我还读入并为我的标签创建了一个 RDD[Vector]：

val docs: RDD[Seq[String]] = sc.textFile("Documents.txt").map(_.split(" ").toSeq)
val labs: RDD[Vector] = sc.textFile("Labels.txt")
  .map(s => Vectors.dense(s.split(',').map(_.toDouble)))

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(docs)
tf.cache()

val idf = new IDF(minDocFreq = 3).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

我想使用 tfidf 和 labs 创建一个 RDD[LabeledPoint]，但我不确定如何使用两个不同的 RDDs 应用映射。这甚至 possible/efficient，还是我需要重新考虑我的方法？

Answer 1

处理此问题的一种方法是 join 基于索引：

import org.apache.spark.RangePartitioner

// Add indices
val idfIndexed = idf.zipWithIndex.map(_.swap)
val labelsIndexed = labels.zipWithIndex.map(_.swap)

// Create range partitioner on larger RDD
val partitioner = new RangePartitioner(idfIndexed.partitions.size, idfIndexed)

// Join with custom partitioner
labelsIndexed.join(idfIndexed, partitioner).values

Spark MLlib - 从 RDD[Vector] 特征和 RDD[Vector] 标签创建 LabeledPoint

Spark MLib - Create LabeledPoint from RDD[Vector] features and RDD[Vector] label

scala

classification

apache-spark

apache-spark-mllib