如何避免 Hive Staging Area Write on Cloud

Question

我必须经常将 Dataframes 编写为 Hive tables。

df.write.mode('overwrite').format('hive').saveAsTable(f'db.{file_nm}_PT')

或使用 Spark SQL 或 Hive SQL 将一个 table 复制到另一个作为备份。

INSERT OVERWRITE TABLE db.tbl_bkp PARTITION (op_cd, rpt_dt)
SELECT * FROM db.tbl;

问题是：写入 hive_saging_directory 占用了总时间的 25%，而 75% 或更多时间用于将 ORC 文件从暂存目录移动到最终分区目录结构。

21/11/13 00:51:25 INFO hive.ql.metadata.Hive: New loading path = gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2019-10-24 with partSpec {rpt_dt=2019-10-24}
21/11/13 00:51:56 INFO hive.ql.metadata.Hive: Replacing src:gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2018-02-18/part-00058-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, dest: gs://sam_tables/teradata/tbl_bkp/rpt_dt=2018-02-18/part-00058-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, Status:true
21/11/13 00:51:56 INFO hive.ql.metadata.Hive: New loading path = gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2018-02-18 with partSpec {rpt_dt=2018-02-18}
21/11/13 00:52:31 INFO hive.ql.metadata.Hive: Replacing src:gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2019-01-29/part-00046-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, dest: gs://sam_tables/teradata/tbl_bkp/rpt_dt=2019-01-29/part-00046-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, Status:true
21/11/13 00:52:31 INFO hive.ql.metadata.Hive: New loading path = gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2019-01-29 with partSpec {rpt_dt=2019-01-29}
21/11/13 00:53:09 INFO hive.ql.metadata.Hive: Replacing src:gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2020-08-01/part-00020-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, dest: gs://sam_tables/teradata/tbl_bkp/rpt_dt=2020-08-01/part-00020-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, Status:true
21/11/13 00:53:09 INFO hive.ql.metadata.Hive: New loading path = gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2020-08-01 with partSpec {rpt_dt=2020-08-01}
21/11/13 00:53:46 INFO hive.ql.metadata.Hive: Replacing src:gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2021-07-12/part-00026-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, dest: gs://sam_tables/teradata/tbl_bkp/rpt_dt=2021-07-12/part-00026-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, Status:true
21/11/13 00:53:46 INFO hive.ql.metadata.Hive: New loading path = gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2021-07-12 with partSpec {rpt_dt=2021-07-12}
21/11/13 00:54:17 INFO hive.ql.metadata.Hive: Replacing src:gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2022-01-21/part-00062-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, dest: gs://sam_tables/teradata/tbl_bkp/rpt_dt=2022-01-21/part-00062-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, Status:true
21/11/13 00:54:17 INFO hive.ql.metadata.Hive: New loading path = gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2022-01-21 with partSpec {rpt_dt=2022-01-21}
21/11/13 00:54:49 INFO hive.ql.metadata.Hive: Replacing src:gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2018-01-20/part-00063-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, dest: gs://sam_tables/teradata/tbl_bkp/rpt_dt=2018-01-20/part-00063-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, Status:true
21/11/13 00:54:49 INFO hive.ql.metadata.Hive: New loading path = gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2018-01-20 with partSpec {rpt_dt=2018-01-20}
21/11/13 00:55:22 INFO hive.ql.metadata.Hive: Replacing src:gs://sam_tables/teradata/tbl_bkp/.hive-staging_hive_2021-11-12_23-26-38_441_6664318328991520567-1/-ext-10000/rpt_dt=2019-09-01/part-00037-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, dest: gs://sam_tables/teradata/tbl_bkp/rpt_dt=2019-09-01/part-00037-95ee77f0-4e27-4765-a454-e5009c4f33f3.c000, Status:true

此操作在实际 HDFS 上非常快，但在 Google 云 blob 上，此重命名实际上是复制粘贴 blob，速度非常慢。

我听说过直接路径写入，请问大家可以建议怎么做吗？

Answer 1

（不是这样）简答

这……很复杂。非常复杂。我想写一个简短的答案，但我冒着在几点上产生误导的风险。相反，我将尝试对非常长的答案进行非常简短的总结。

Hive 使用暂存目录有一个很好的理由：原子性。您不希望用户在重写 table 时读取它，因此您在暂存目录中写入并在完成后重命名该目录，like this.
问题是：云存储是“对象存储”，而不是像 HDFS 这样的“分布式文件系统”，some operations like folder renaming can be much slower because of that。
每个云都有自己的存储实现，有自己的特点和缺点，随着时间的推移，他们甚至提出新的变体来克服其中的一些缺点（例如，Azure 有 3 种不同的存储变体：Blob Storage、Datalake Storage Gen 1和第 2 代）。
因此，一种云上的最佳解决方案不一定是另一种云上的最佳解决方案。
用于各种云存储的文件系统 API 实现是 Spark 使用的 Hadoop 发行版的一部分。因此，您可用的解决方案还取决于您安装的 Spark 使用的 Hadoop 版本。
Azure/GCS only: 你可以尝试设置[这个选项]:(https://spark.apache.org/docs/3.1.1/cloud-integration.html#configuring): spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2。它比 v1 快，但也不推荐，因为它不是原子的，因此 不太安全 在部分失败的情况下 .
v2 目前是 Hadoop 中的默认值，但 Spark 3 set it back to v1 by default and there are some discussion in the Hadoop community 弃用它并使 v1 再次成为默认值。
还有一些ongoing development to write better output committers for Azure and GCS, based on a similar output committer done for S3.

或者，您可以尝试切换到云优先格式，例如 Apache Iceberg、Apache Hudi 或 Delta Lake。
我对这些还不是很熟悉，但是快速浏览一下Delta Lake's documentation convinced me that they had to deal with same kind of issues (cloud storages not being real file systems), and depending on which cloud you're on, it may require extra configuration, especially on GCP where the feature is flagged as experimental。
编辑： Apache Iceberg 没有这个问题，因为它使用 metadata files to point to the real data files location。因此，对 table 的更改是通过对单个元数据文件的原子更改来提交的。
我对 Apache Hudi 不是很熟悉，我找不到关于他们处理此类问题的任何提及。我必须进一步深入研究他们的设计架构才能确定。

现在，为了长篇大论，也许我应该写一篇博客文章...我会 post 写完就放在这里。

如何避免 Hive Staging Area Write on Cloud

How to avoid Hive Staging Area Write on Cloud

hadoop

hive

google-cloud-storage

apache-spark

（不是这样）简答