将 RDD 转换为 Spark Dataframe (Pyspark)。这奏效了。但是给出新的错误
Convert and RDD to Spark Dataframe (Pyspark). This worked. But giving new error
我有一个 RDD:
rd.take(2)
[Row(id=0, items=['ab', 'nccd], actor='brad'),
Row(id=1, items=['rd', 'fh'], actor='tony')]
我正在尝试将其转换为 spark 数据帧:
df = spark.createDataFrame(rd)
这对我有用。
但现在当我尝试 运行 时:
df.show()
这给我错误。这是工作。
请给我一些见解
Error:
Py4JJavaError: An error occurred while calling o1264.showString.
: java.lang.IllegalStateException: SparkContext has been shutdown
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2021)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
at org.apache.spark.sql.Dataset$$anonfun$head.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$head.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun.apply(Dataset.scala:2842)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
您可能知道 Apache Spark 是一个惰性求值器。您可以执行一些操作和转换。当一个动作被调用时,所有的转换都会被调用。因此,当您进行 show() 或 collect() 调用时,您之前调用的所有函数都将被处理。因此,您对 createDataFrame 的调用显然无效。
请阅读此 post,它将让您了解如何实现所需的输出:
在@pissall 所说的之上,以下应该有效:
from pyspark.sql.types import *
schema = StructType([StructField('id', IntegerType()),
StructField('items', ArrayType(StringType())),
StructField('actor', StringType())
])
df = spark.createDataFrame(rd, schema)
我有一个 RDD:
rd.take(2)
[Row(id=0, items=['ab', 'nccd], actor='brad'),
Row(id=1, items=['rd', 'fh'], actor='tony')]
我正在尝试将其转换为 spark 数据帧:
df = spark.createDataFrame(rd)
这对我有用。
但现在当我尝试 运行 时:
df.show()
这给我错误。这是工作。 请给我一些见解
Error:
Py4JJavaError: An error occurred while calling o1264.showString.
: java.lang.IllegalStateException: SparkContext has been shutdown
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2021)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
at org.apache.spark.sql.Dataset$$anonfun$head.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$head.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun.apply(Dataset.scala:2842)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
您可能知道 Apache Spark 是一个惰性求值器。您可以执行一些操作和转换。当一个动作被调用时,所有的转换都会被调用。因此,当您进行 show() 或 collect() 调用时,您之前调用的所有函数都将被处理。因此,您对 createDataFrame 的调用显然无效。
请阅读此 post,它将让您了解如何实现所需的输出:
在@pissall 所说的之上,以下应该有效:
from pyspark.sql.types import *
schema = StructType([StructField('id', IntegerType()),
StructField('items', ArrayType(StringType())),
StructField('actor', StringType())
])
df = spark.createDataFrame(rd, schema)