Spark 无法在分区和追加模式下写入新的配置单元 table
Spark not able to write into a new hive table in partitioned and append mode
- 在配置单元中以
partitioned
和 ORC
格式创建了一个新的 table。
- 使用
append
、orc
和 partitioned
模式使用 spark 写入此 table。
失败,异常:
org.apache.spark.sql.AnalysisException: The format of the existing table test.table1 is `HiveFileFormat`. It doesn't match the specified format `OrcFileFormat`.;
- 我在写作时将格式从 "orc" 更改为 "hive"。它仍然失败,但例外:
Spark 无法理解 table 的底层结构。
所以这个问题的发生是因为 spark is not able to write into hive table in append mode , because it cant create a new table
。我可以做到 overwrite successfully because spark creates a table again
.
但我的用例是从一开始就写入追加模式。 InsertInto also does not work specifically for partitioned tables
。我的用例几乎被阻止了。任何帮助都会很棒。
编辑 1:
在 HDP 3.1.0 环境中工作。
Spark 版本为 2.3.2
Hive 版本为 3.1.0
编辑 2:
// Reading the table
val inputdf=spark.sql("select id,code,amount from t1")
//writing into table
inputdf.write.mode(SaveMode.Append).partitionBy("code").format("orc").saveAsTable("test.t2")
编辑 3:使用 insertInto()
val df2 =spark.sql("select id,code,amount from t1")
df2.write.format("orc").mode("append").insertInto("test.t2");
我得到的错误是:
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN HiveMetastoreCatalog: Unable to infer schema for table test.t1 from file format ORC (inference mode: INFER_AND_SAVE). Using metastore schema.
如果我重新运行 insertInto 命令,我会得到以下异常:
20/05/17 19:16:37 ERROR Hive: MetaException(message:The transaction for alter partition did not commit successfully.)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
配置单元 Metastore 日志中的错误:
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(907)) - 163: alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(349)) - ugi=X@A.ORG ip=10.10.1.36 cmd=alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:alter_partitions_with_environment_context(5119)) - New partition values:[BR]
2020-05-17T21:17:43,913 ERROR [pool-8-thread-198]: metastore.ObjectStore (ObjectStore.java:alterPartitions(4397)) - Alter failed
org.apache.hadoop.hive.metastore.api.MetaException: Cannot change stats state for a transactional table without providing the transactional write state for verification (new write ID -1, valid write IDs null; current state null; new state {}
我能够通过在我的用例中使用外部 table 来解决问题。我们目前在 spark 中有一个未解决的问题,它与 hive 的酸属性有关。在外部模式下创建配置单元 table 后,我可以在 partitioned/non 分区 table 中执行追加操作。
https://issues.apache.org/jira/browse/SPARK-15348
- 在配置单元中以
partitioned
和ORC
格式创建了一个新的 table。 - 使用
append
、orc
和partitioned
模式使用 spark 写入此 table。
失败,异常:
org.apache.spark.sql.AnalysisException: The format of the existing table test.table1 is `HiveFileFormat`. It doesn't match the specified format `OrcFileFormat`.;
- 我在写作时将格式从 "orc" 更改为 "hive"。它仍然失败,但例外: Spark 无法理解 table 的底层结构。
所以这个问题的发生是因为 spark is not able to write into hive table in append mode , because it cant create a new table
。我可以做到 overwrite successfully because spark creates a table again
.
但我的用例是从一开始就写入追加模式。 InsertInto also does not work specifically for partitioned tables
。我的用例几乎被阻止了。任何帮助都会很棒。
编辑 1: 在 HDP 3.1.0 环境中工作。
Spark 版本为 2.3.2
Hive 版本为 3.1.0
编辑 2:
// Reading the table
val inputdf=spark.sql("select id,code,amount from t1")
//writing into table
inputdf.write.mode(SaveMode.Append).partitionBy("code").format("orc").saveAsTable("test.t2")
编辑 3:使用 insertInto()
val df2 =spark.sql("select id,code,amount from t1")
df2.write.format("orc").mode("append").insertInto("test.t2");
我得到的错误是:
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN AcidUtils: Cannot get ACID state for test.t1 from null
20/05/17 19:15:13 WARN HiveMetastoreCatalog: Unable to infer schema for table test.t1 from file format ORC (inference mode: INFER_AND_SAVE). Using metastore schema.
如果我重新运行 insertInto 命令,我会得到以下异常:
20/05/17 19:16:37 ERROR Hive: MetaException(message:The transaction for alter partition did not commit successfully.)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$alter_partitions_req_result$alter_partitions_req_resultStandardScheme.read(ThriftHiveMetastore.java)
配置单元 Metastore 日志中的错误:
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(907)) - 163: alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(349)) - ugi=X@A.ORG ip=10.10.1.36 cmd=alter_partitions : tbl=hive.test.t1
2020-05-17T21:17:43,891 INFO [pool-8-thread-198]: metastore.HiveMetaStore (HiveMetaStore.java:alter_partitions_with_environment_context(5119)) - New partition values:[BR]
2020-05-17T21:17:43,913 ERROR [pool-8-thread-198]: metastore.ObjectStore (ObjectStore.java:alterPartitions(4397)) - Alter failed
org.apache.hadoop.hive.metastore.api.MetaException: Cannot change stats state for a transactional table without providing the transactional write state for verification (new write ID -1, valid write IDs null; current state null; new state {}
我能够通过在我的用例中使用外部 table 来解决问题。我们目前在 spark 中有一个未解决的问题,它与 hive 的酸属性有关。在外部模式下创建配置单元 table 后,我可以在 partitioned/non 分区 table 中执行追加操作。 https://issues.apache.org/jira/browse/SPARK-15348