将 spark scala 数据集转换为特定的 RDD 格式
Convert spark scala dataset to specific RDD format
我有一个如下所示的数据框:
+--------------------+-----------------+
| recommendations|relevant_products|
+--------------------+-----------------+
|[12949, 12949, 71...| [4343]|
|[12949, 12949, 71...| [1589]|
|[12949, 12949, 71...| [11497]|
evaluation_ds:org.apache.spark.sql.Dataset[docCompare] = [recommendations: array, relevant_products: array]
这是数据集中使用的class:case class docCompare (recommendations: Array[Int], relevant_products: Array[Int])
如何将其转换为以下格式的 JavaRDD:
org.apache.spark.rdd.RDD[(Array[?], Array[?])]
您可以简单地将rdd应用于数据集,如下所示:
val evaluation_ds = Seq(
(Seq(3446, 3843, 1809), Seq(1249)),
(Seq(4557, 4954, 2920), Seq(2360))
).toDF("recommendations", "relevant_products").as[(Array[Int], Array[Int])]
import org.apache.spark.mllib.evaluation.RankingMetrics
val metrics = new RankingMetrics(evaluation_ds.rdd)
// metrics: org.apache.spark.mllib.evaluation.RankingMetrics[Int] = ...
我有一个如下所示的数据框:
+--------------------+-----------------+
| recommendations|relevant_products|
+--------------------+-----------------+
|[12949, 12949, 71...| [4343]|
|[12949, 12949, 71...| [1589]|
|[12949, 12949, 71...| [11497]|
evaluation_ds:org.apache.spark.sql.Dataset[docCompare] = [recommendations: array, relevant_products: array]
这是数据集中使用的class:case class docCompare (recommendations: Array[Int], relevant_products: Array[Int])
如何将其转换为以下格式的 JavaRDD:
org.apache.spark.rdd.RDD[(Array[?], Array[?])]
您可以简单地将rdd应用于数据集,如下所示:
val evaluation_ds = Seq(
(Seq(3446, 3843, 1809), Seq(1249)),
(Seq(4557, 4954, 2920), Seq(2360))
).toDF("recommendations", "relevant_products").as[(Array[Int], Array[Int])]
import org.apache.spark.mllib.evaluation.RankingMetrics
val metrics = new RankingMetrics(evaluation_ds.rdd)
// metrics: org.apache.spark.mllib.evaluation.RankingMetrics[Int] = ...