如果不使用 Map Reduce,Hive 3 上的 Hive ORC ACID 是否需要 TEZ?

Does Hive ORC ACID on Hive 3 require TEZ if not using Map Reduce?

我的理解was/is,对于 Hive 3,使用 MERGE 的 HIVE ORC ACID table 也至少需要 TEZ 作为底层执行引擎,如果没有使用 Map Reduce 或 Hive 的 Spark 引擎。事实上,我不相信 HIVE MERGE、更新、删除与 Spark 引擎一起工作。

但是从文档和各种更新中我无法确认这些,因此发布了这篇文章。关于这个主题似乎很难写出一套连贯的散文,而且我远离集群。

而且,来自 https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-version-release 的斜体和粗体声明说明 完整的事务功能 我无法理解,因为我不知道 SPARK 可以在 HIVE ORC 上更新、删除酸(还):

Apache Spark

Apache Spark gets updatable tables and ACID transactions with Hive Warehouse Connector. Hive Warehouse Connector allows you to register Hive transactional tables as external tables in Spark to access full transactional functionality. Previous versions only supported table partition manipulation. Hive Warehouse Connector also supports Streaming DataFrames for streaming reads and writes into transactional and streaming Hive tables from Spark.

Spark executors can connect directly to Hive LLAP daemons to retrieve and update data in a transactional manner, allowing Hive to keep control of the data.

Apache Spark on HDInsight 4.0 supports the following scenarios:

Run machine learning model training over the same transactional table used for reporting. Use ACID transactions to safely add columns from Spark ML to a Hive table. Run a Spark streaming job on the change feed from a Hive streaming table. Create ORC files directly from a Spark Structured Streaming job. You no longer have to worry about accidentally trying to access Hive transactional tables directly from Spark, resulting in inconsistent results, duplicate data, or data corruption. In HDInsight 4.0, Spark tables and Hive tables are kept in separate Metastores. Use Hive Data Warehouse Connector to explicitly register Hive transactional tables as Spark external tables.

上面的粗斜体说法不正确。

https://issues.apache.org/jira/browse/SPARK-15348 明确指出 Spark 不允许 HIVE ORC ACID 处理。

MR在各种云平台上消失了,现在TEZ是默认引擎,所以sqoop和Hive ORC ACID都用它,因此至少需要TEZ。

注意:我只是在上次作业时才问这个问题,这个讨论来自 'upstairs'。