不能 select 特定列用于 ReduceByKey 操作 Spark

cannot select a specific column for ReduceByKey operation Spark

我创建了一个如下所示的DataFrame,我想对列'title'应用map-reduce算法,但是当我使用reduceByKey函数时,我遇到了一些问题。

+-------+--------------------+------------+-----------+
|project|               title|requests_num|return_size|
+-------+--------------------+------------+-----------+
|     aa|%CE%92%CE%84_%CE%...|           1|       4854|
|     aa|%CE%98%CE%B5%CF%8...|           1|       4917|
|     aa|%CE%9C%CF%89%CE%A...|           1|       4832|
|     aa|%CE%A0%CE%B9%CE%B...|           1|       4828|
|     aa|%CE%A3%CE%A4%CE%8...|           1|       4819|
|     aa|%D0%A1%D0%BE%D0%B...|           1|       4750|
|     aa|             271_a.C|           1|       4675|
|     aa|Battaglia_di_Qade...|           1|       4765|
|     aa|    Category:User_th|           1|       4770|
|     aa|  Chiron_Elias_Krase|           1|       4694|
|     aa|County_Laois/en/Q...|           1|       4752|
|     aa|    Dassault_rafaele|           2|       9372|
|     aa|Dyskusja_wikiproj...|           1|       4824|
|     aa|              E.Desv|           1|       4662|
|     aa|Enclos-apier/fr/E...|           1|       4772|
|     aa|File:Wiktionary-l...|           1|      10752|
|     aa|Henri_de_Sourdis/...|           1|       4748|
|     aa|Incentive_Softwar...|           1|       4777|
|     aa|Indonesian_Wikipedia|           1|       4679|
|     aa|           Main_Page|           5|     266946|
+-------+--------------------+------------+-----------+

我试过了,但没用:

dataframe.select("title").map(word => (word,1)).reduceByKey(_+_);

看来我应该先将dataframe转入list,再用map函数生成key-value pair(word,1),最后汇总key value。 我是一种将数据帧从 Whosebug 传输到列表的方法, 例如

val text =dataframe.select("title").map(r=>r(0).asInstanceOf[String]).collect()

但是出现错误

scala> val text = dataframe.select("title").map(r=>r(0).asInstanceOf[String]).collect()
2018-04-08 21:49:35 WARN  NettyRpcEnv:66 - Ignored message: HeartbeatResponse(false)
2018-04-08 21:49:35 WARN  Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout.applyOrElse(RpcTimeout.scala:62)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout.applyOrElse(RpcTimeout.scala:58)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
    at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
    at org.apache.spark.executor.Executor$$anon$$anonfun$run.apply$mcV$sp(Executor.scala:814)
    at org.apache.spark.executor.Executor$$anon$$anonfun$run.apply(Executor.scala:814)
    at org.apache.spark.executor.Executor$$anon$$anonfun$run.apply(Executor.scala:814)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
    at org.apache.spark.executor.Executor$$anon.run(Executor.scala:814)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    ... 14 more
java.lang.OutOfMemoryError: Java heap space
  at org.apache.spark.sql.execution.SparkPlan$$anon.next(SparkPlan.scala:280)
  at org.apache.spark.sql.execution.SparkPlan$$anon.next(SparkPlan.scala:276)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at org.apache.spark.sql.execution.SparkPlan$$anon.foreach(SparkPlan.scala:276)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect.apply(SparkPlan.scala:298)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect.apply(SparkPlan.scala:297)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
  at org.apache.spark.sql.Dataset$$anonfun$collect.apply(Dataset.scala:2722)
  at org.apache.spark.sql.Dataset$$anonfun$collect.apply(Dataset.scala:2722)
  at org.apache.spark.sql.Dataset$$anonfun.apply(Dataset.scala:3253)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
  at org.apache.spark.sql.Dataset.collect(Dataset.scala:2722)
  ... 16 elided

scala> val text = dataframe.select("title").map(r=>r(0).asInstanceOf[String]).collect()
java.lang.OutOfMemoryError: GC overhead limit exceeded                          
  at org.apache.spark.sql.execution.SparkPlan$$anon.next(SparkPlan.scala:280)
  at org.apache.spark.sql.execution.SparkPlan$$anon.next(SparkPlan.scala:276)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at org.apache.spark.sql.execution.SparkPlan$$anon.foreach(SparkPlan.scala:276)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect.apply(SparkPlan.scala:298)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect.apply(SparkPlan.scala:297)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
  at org.apache.spark.sql.Dataset$$anonfun$collect.apply(Dataset.scala:2722)
  at org.apache.spark.sql.Dataset$$anonfun$collect.apply(Dataset.scala:2722)
  at org.apache.spark.sql.Dataset$$anonfun.apply(Dataset.scala:3253)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
  at org.apache.spark.sql.Dataset.collect(Dataset.scala:2722)
  ... 16 elided

Collect-ing 你的 DataFrame 到 Scala 集合会对你的数据集大小施加限制。相反,您可以将 DataFrame 转换为 RDD,然后应用 mapreduceByKey,如下所示:

val df = Seq(
  ("aa", "271_a.C", 1, 4675),
  ("aa", "271_a.C", 1, 4400),
  ("aa", "271_a.C", 1, 4600),
  ("aa", "Chiron_Elias_Krase", 1, 4694),
  ("aa", "Chiron_Elias_Krase", 1, 4500)
).toDF("project", "title", "request_num", "return_size")

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row

val rdd = df.rdd.
  map{ case Row(_, title: String, _, _) => (title, 1) }.
  reduceByKey(_ + _)

rdd.collect
// res1: Array[(String, Int)] = Array((Chiron_Elias_Krase,2), (271_a.C,3))

您也可以直接使用 groupBy:

转换您的 DataFrame
df.groupBy($"title").agg(count($"title").as("count")).
  show
// +------------------+-----+
// |             title|count|
// +------------------+-----+
// |           271_a.C|    3|
// |Chiron_Elias_Krase|    2|
// +------------------+-----+