如何通过 Spark 2.3 (pyspark) 在 Hive 3.1 中创建镶木地板 table
How to create parquet table in Hive 3.1 through Spark 2.3 (pyspark)
在 creating/loading 拼花地板 table 来自 Spark
时遇到问题
环境详细信息:
Horotonworks HDP3.0
Spark 2.3.1
配置单元 3.1
1#. 当尝试通过 Spark 2.3 在 Hive 3.1 中创建镶木地板 table 时,Spark 抛出以下错误。
df.write.format("parquet").mode("overwrite").saveAsTable("database_name.test1")
pyspark.sql.utils.AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table datamart.test1 failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.);'
2#. 成功将数据插入现有镶木地板 table 并通过 Spark 检索。
df.write.format("parquet").mode("overwrite").insertInto("database_name.test2")
spark.sql("select * from database_name.test2").show()
spark.read.parquet("/path-to-table-dir/part-00000.snappy.parquet").show()
但是当我尝试通过 Hive 读取相同的 table 时,Hive 会话断开连接并抛出以下错误。
SELECT * FROM database_name.test2
org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376)
at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453)
at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435)
at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_FetchResults(TCLIService.java:567)
at org.apache.hive.service.rpc.thrift.TCLIService$Client.FetchResults(TCLIService.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hive.jdbc.HiveConnection$SynchronizedHandler.invoke(HiveConnection.java:1572)
at com.sun.proxy.$Proxy22.FetchResults(Unknown Source)
at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:373)
at org.apache.hive.beeline.BufferedRows.<init>(BufferedRows.java:56)
at org.apache.hive.beeline.IncrementalRowsWithNormalization.<init>(IncrementalRowsWithNormalization.java:50)
at org.apache.hive.beeline.BeeLine.print(BeeLine.java:2250)
at org.apache.hive.beeline.Commands.executeInternal(Commands.java:1026)
at org.apache.hive.beeline.Commands.execute(Commands.java:1201)
at org.apache.hive.beeline.Commands.sql(Commands.java:1130)
at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1425)
at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1287)
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1071)
at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:538)
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:520)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:318)
at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
Unknown HS2 problem when communicating with Thrift server.
Error: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe (Write failed) (state=08S01,code=0)
发生此错误后,Hive 会话断开连接,我必须重新连接。所有其他查询都工作正常,只有这个查询显示上述错误并断开连接。
出现此问题是因为在没有 Hive Warehouse Connector 的情况下访问了 Hive table。
默认情况下,spark 使用 spark 目录,下面的文章解释了如何通过 Spark 访问 Apache Hive table。
Integrating Apache Hive with Apache Spark - Hive Warehouse Connector
从 HDP 3.0 开始,Apache Hive 和 Apache Spark 的目录是分开的,它们使用自己的目录;也就是说,它们是互斥的——Apache Hive 目录只能被 Apache Hive 或这个库访问,而 Apache Spark 目录只能被 Apache Spark 中现有的 API 访问。换句话说,一些功能,如 ACID tables 或 Apache Ranger with Apache Hive table 只能通过 Apache Spark 中的这个库获得。 Hive 中的那些 table 不应在 Apache Spark API 本身中直接访问。
在 creating/loading 拼花地板 table 来自 Spark
时遇到问题环境详细信息:
Horotonworks HDP3.0
Spark 2.3.1
配置单元 3.1
1#. 当尝试通过 Spark 2.3 在 Hive 3.1 中创建镶木地板 table 时,Spark 抛出以下错误。
df.write.format("parquet").mode("overwrite").saveAsTable("database_name.test1")
pyspark.sql.utils.AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Table datamart.test1 failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.);'
2#. 成功将数据插入现有镶木地板 table 并通过 Spark 检索。
df.write.format("parquet").mode("overwrite").insertInto("database_name.test2")
spark.sql("select * from database_name.test2").show()
spark.read.parquet("/path-to-table-dir/part-00000.snappy.parquet").show()
但是当我尝试通过 Hive 读取相同的 table 时,Hive 会话断开连接并抛出以下错误。
SELECT * FROM database_name.test2
org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376)
at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453)
at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435)
at org.apache.thrift.transport.TSaslClientTransport.read(TSaslClientTransport.java:37)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
at org.apache.hive.service.rpc.thrift.TCLIService$Client.recv_FetchResults(TCLIService.java:567)
at org.apache.hive.service.rpc.thrift.TCLIService$Client.FetchResults(TCLIService.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hive.jdbc.HiveConnection$SynchronizedHandler.invoke(HiveConnection.java:1572)
at com.sun.proxy.$Proxy22.FetchResults(Unknown Source)
at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:373)
at org.apache.hive.beeline.BufferedRows.<init>(BufferedRows.java:56)
at org.apache.hive.beeline.IncrementalRowsWithNormalization.<init>(IncrementalRowsWithNormalization.java:50)
at org.apache.hive.beeline.BeeLine.print(BeeLine.java:2250)
at org.apache.hive.beeline.Commands.executeInternal(Commands.java:1026)
at org.apache.hive.beeline.Commands.execute(Commands.java:1201)
at org.apache.hive.beeline.Commands.sql(Commands.java:1130)
at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1425)
at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1287)
at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1071)
at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:538)
at org.apache.hive.beeline.BeeLine.main(BeeLine.java:520)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:318)
at org.apache.hadoop.util.RunJar.main(RunJar.java:232)
Unknown HS2 problem when communicating with Thrift server.
Error: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe (Write failed) (state=08S01,code=0)
发生此错误后,Hive 会话断开连接,我必须重新连接。所有其他查询都工作正常,只有这个查询显示上述错误并断开连接。
出现此问题是因为在没有 Hive Warehouse Connector 的情况下访问了 Hive table。
默认情况下,spark 使用 spark 目录,下面的文章解释了如何通过 Spark 访问 Apache Hive table。
Integrating Apache Hive with Apache Spark - Hive Warehouse Connector
从 HDP 3.0 开始,Apache Hive 和 Apache Spark 的目录是分开的,它们使用自己的目录;也就是说,它们是互斥的——Apache Hive 目录只能被 Apache Hive 或这个库访问,而 Apache Spark 目录只能被 Apache Spark 中现有的 API 访问。换句话说,一些功能,如 ACID tables 或 Apache Ranger with Apache Hive table 只能通过 Apache Spark 中的这个库获得。 Hive 中的那些 table 不应在 Apache Spark API 本身中直接访问。