py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statementment
py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment
我正在尝试在 spark 数据框 (python) 中执行一个简单的任务,即通过从另一个数据框中选择特定列和嵌套列来创建新的数据框
例如:
df.printSchema()
root
|-- time_stamp: long (nullable = true)
|-- country: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- id: long (nullable = true)
| |-- time_zone: string (nullable = true)
|-- event_name: string (nullable = true)
|-- order: struct (nullable = true)
| |-- created_at: string (nullable = true)
| |-- creation_type: struct (nullable = true)
| | |-- id: long (nullable = true)
| | |-- name: string (nullable = true)
| |-- destination: struct (nullable = true)
| | |-- state: string (nullable = true)
| |-- ordering_user: struct (nullable = true)
| | |-- cancellation_score: long (nullable = true)
| | |-- id: long (nullable = true)
| | |-- is_test: boolean (nullable = true)
df2=df.sqlContext.sql("""select a.country_code as country_code,
a.order_destination_state as order_destination_state,
a.order_ordering_user_id as order_ordering_user_id,
a.order_ordering_user_is_test as order_ordering_user_is_test,
a.time_stamp as time_stamp
from
(select
flat_order_creation.order.destination.state as order_destination_state,
flat_order_creation.order.ordering_user.id as order_ordering_user_id,
flat_order_creation.order.ordering_user.is_test as order_ordering_user_is_test,
flat_order_creation.time_stamp as time_stamp
from flat_order_creation) a""")
我收到以下错误:
Traceback (most recent call last):
File "/home/hadoop/scripts/orders_all.py", line 180, in <module>
df2=sqlContext.sql(q)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 552, in sql
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, in deco
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o60.sql.
: java.lang.RuntimeException: [6.21] failure: ``*'' expected but `order' found
flat_order_creation.order.destination.state as order_destination_state,
我在本地模式下使用 spark-submit 和 master 来 运行 这段代码。
重要的是要提到当我连接到 pyspark shell 和 运行 时代码(逐行)有效,但是在提交时(即使在本地模式下)它失败了。
another thing is important to mention is that when selecting a non nested field it works as well.
我在 EMR(版本 4.2.0)上使用 spark 1.5.2
没有 a Minimal, Complete, and Verifiable example 我只能猜测,但看起来您在交互式 shell 和独立程序中使用了不同的 SparkContext
实现。
只要在 shell 中提供的 Hive 支持 sqlContext
构建了 Spark 二进制文件就是 HiveContext
。在其他差异中,它提供了比普通 SQLContext
更复杂的 SQL 解析器。您可以按如下方式轻松重现您的问题:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
val conf: SparkConf = ???
val sc: SparkContext = ???
val query = "SELECT df.foobar.order FROM df"
val hiveContext: SQLContext = new HiveContext(sc)
val sqlContext: SQLContext = new SQLContext(sc)
val json = sc.parallelize(Seq("""{"foobar": {"order": 1}}"""))
sqlContext.read.json(json).registerTempTable("df")
sqlContext.sql(query).show
// java.lang.RuntimeException: [1.18] failure: ``*'' expected but `order' found
// ...
hiveContext.read.json(json).registerTempTable("df")
hiveContext.sql(query)
// org.apache.spark.sql.DataFrame = [order: bigint]
在独立程序中使用 HiveContext
初始化 sqlContext
应该可以解决问题:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame(...)
df.registerTempTable("flat_order_creation")
sqlContext.sql(...)
请务必注意,问题不在于嵌套本身,而是使用 ORDER
关键字作为列名。因此,如果使用 HiveContext
不是一个选项,只需将字段名称更改为其他名称即可。
我正在尝试在 spark 数据框 (python) 中执行一个简单的任务,即通过从另一个数据框中选择特定列和嵌套列来创建新的数据框 例如:
df.printSchema()
root
|-- time_stamp: long (nullable = true)
|-- country: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- id: long (nullable = true)
| |-- time_zone: string (nullable = true)
|-- event_name: string (nullable = true)
|-- order: struct (nullable = true)
| |-- created_at: string (nullable = true)
| |-- creation_type: struct (nullable = true)
| | |-- id: long (nullable = true)
| | |-- name: string (nullable = true)
| |-- destination: struct (nullable = true)
| | |-- state: string (nullable = true)
| |-- ordering_user: struct (nullable = true)
| | |-- cancellation_score: long (nullable = true)
| | |-- id: long (nullable = true)
| | |-- is_test: boolean (nullable = true)
df2=df.sqlContext.sql("""select a.country_code as country_code,
a.order_destination_state as order_destination_state,
a.order_ordering_user_id as order_ordering_user_id,
a.order_ordering_user_is_test as order_ordering_user_is_test,
a.time_stamp as time_stamp
from
(select
flat_order_creation.order.destination.state as order_destination_state,
flat_order_creation.order.ordering_user.id as order_ordering_user_id,
flat_order_creation.order.ordering_user.is_test as order_ordering_user_is_test,
flat_order_creation.time_stamp as time_stamp
from flat_order_creation) a""")
我收到以下错误:
Traceback (most recent call last):
File "/home/hadoop/scripts/orders_all.py", line 180, in <module>
df2=sqlContext.sql(q)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 552, in sql
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, in deco
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o60.sql.
: java.lang.RuntimeException: [6.21] failure: ``*'' expected but `order' found
flat_order_creation.order.destination.state as order_destination_state,
我在本地模式下使用 spark-submit 和 master 来 运行 这段代码。 重要的是要提到当我连接到 pyspark shell 和 运行 时代码(逐行)有效,但是在提交时(即使在本地模式下)它失败了。 another thing is important to mention is that when selecting a non nested field it works as well. 我在 EMR(版本 4.2.0)上使用 spark 1.5.2
没有 a Minimal, Complete, and Verifiable example 我只能猜测,但看起来您在交互式 shell 和独立程序中使用了不同的 SparkContext
实现。
只要在 shell 中提供的 Hive 支持 sqlContext
构建了 Spark 二进制文件就是 HiveContext
。在其他差异中,它提供了比普通 SQLContext
更复杂的 SQL 解析器。您可以按如下方式轻松重现您的问题:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
val conf: SparkConf = ???
val sc: SparkContext = ???
val query = "SELECT df.foobar.order FROM df"
val hiveContext: SQLContext = new HiveContext(sc)
val sqlContext: SQLContext = new SQLContext(sc)
val json = sc.parallelize(Seq("""{"foobar": {"order": 1}}"""))
sqlContext.read.json(json).registerTempTable("df")
sqlContext.sql(query).show
// java.lang.RuntimeException: [1.18] failure: ``*'' expected but `order' found
// ...
hiveContext.read.json(json).registerTempTable("df")
hiveContext.sql(query)
// org.apache.spark.sql.DataFrame = [order: bigint]
在独立程序中使用 HiveContext
初始化 sqlContext
应该可以解决问题:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame(...)
df.registerTempTable("flat_order_creation")
sqlContext.sql(...)
请务必注意,问题不在于嵌套本身,而是使用 ORDER
关键字作为列名。因此,如果使用 HiveContext
不是一个选项,只需将字段名称更改为其他名称即可。