读取 Hive 的空 Spark 数据集 table

Question

有 2 个 Hive table 使用相同的来源和相同的逻辑创建，但查询略有不同：

Table 1个查询是：

create table test.table1 stored as orc as
    select
        f1,
        mc.f2 as f2,
        mc.f3 as f3,
        f4
    from src.test_table lateral view explode(multiple_field) mcTable as mc
    union all
    select
        f1,
        f5 as f2,
        f6 as f3,
        f4
    from src.test_table
    where multiple_field is null or size(multiple_field) < 1
;

接下来，table 2 查询 - 相同的逻辑，使用 outer explode 缩短：

 create table test.table2 stored as orc as
     select
       f1,
       if(mc is null, f5, mc.f2) as f2,
       if(mc is null, f6, mc.f3) as f3,
       f4
 from src.test_table lateral view outer explode(multiple_field) mcTable as mc
;

两个 table 都已成功创建，包含相同的行数和相同的数据（由 Hive Beeline 客户端检查）。然后我尝试用 Spark 读取 table 的数据：

SparkSession sparkSession = SparkSession
                .builder().config("hive.execution.engine","mr")
                .appName("OrcExportJob")
                .enableHiveSupport()
                .getOrCreate();

String hql = "select * from test.table1"; // or test.table2
Dataset<Row> sqlDF = sparkSession.sql(hql);

在测试的情况下。table2 没关系 - sqlDF 包含所有数据。读取 test.table1 会导致不同的结果 - sqlDF 根本不包含任何数据（0 行）。 Spark 日志显示没有错误 - 就像 table 真的是空的。

我听说 Spark 在读取事务性或分区 Hive tables 时遇到一些问题 - 但事实并非如此。

四处挖掘，我发现 Hive 以不同的方式为我的 table 存储 ORC 文件：

/
├─ user/
│  ├─ hive/
│  │  ├─ warehouse/
│  │  │  ├─ test.db/
│  │  │  │  ├─ table1/
│  │  │  │  │  ├─ 1/
│  │  │  │  │  │  ├─ 1/
│  │  │  │  │  │  │  ├─ 000000_0
│  │  │  │  │  ├─ 2/
│  │  │  │  │  │  ├─ 000000_0
│  │  │  │  │  │  ├─ 000001_0
│  │  │  │  │  │  ├─ 000002_0
│  │  │  │  │  │  ├─ 000003_0
│  │  │  │  ├─ table2/
│  │  │  │  │  ├─ 000000_0
│  │  │  │  │  ├─ 000001_0
│  │  │  │  │  ├─ 000002_0
│  │  │  │  │  ├─ 000003_0

谁能帮我找出 Spark 看不到 Table 1 个数据的原因？

为什么 Hive 为 Table1 保留 5 个目录结构复杂的文件，为 Table2 保留 4 个结构简单的文件？

它会以某种方式影响 Spark 读取过程吗？

P.S。蜂巢版本为 2.3.3， Spark 版本为 2.4.4

Answer 1

通常数据文件位于 table 没有子目录的位置。

UNION ALL 正在优化（很可能您正在使用 Tez）并且每个查询都是运行并行的，独立地作为单独的映射器任务。这需要为 UNION ALL 中的每个查询创建单独的子目录，以便同时写入每个查询的结果，这就是为什么你有两个目录。

这些设置允许 Hive 读取子目录：

set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;

存在问题 SPARK-26663 - Cannot query a Hive table with subdirectories - 由于无法重现而关闭，因为他们执行了在 MR 而不是 Tez 上重现的步骤。

如果您需要阅读此类 table，请尝试使用 HiveContext 并设置上述属性。

顺便说一下，您的第二个查询效率更高，因为您只读取一次源 table 并且不创建子目录。

你也可以试试运行你在 MR 上创建 TABLE，它不会创建子目录 (set hive.execution.engine=mr;)。

还将 UNION ALL 包装到子查询中并添加诸如 DISTRIBUTE BY 或 ORDER 之类的内容将强制执行额外的减少步骤，请参阅

读取 Hive 的空 Spark 数据集 table

Empty Spark dataset reading Hive table

hadoop

hive

apache-spark