Hive 查询失败,出现文件未找到错误,这应该是不可能的

Hive query fails with file not found error that should not be possible

我有一个外部配置单元 table,它按插入的日期和时间进行分区,例如 20200331_0505,格式为 YYYYMMDD_HHMM

目前只有一个分区:

> hdfs dfs -ls /path/to/external/table    
-rw-r----- 2020-03-31 05:06 /path/to/external/table/_SUCCESS  
drwxr-x--- 2020-03-31 05:06 /path/to/external/table/loaddate=20200331_0505  

如果我 运行 一个配置单元查询来查找分区:

select distinct loaddate from table;  
+----------------+
|    loaddate    |
+----------------+
| 20200331_0505  |
+----------------+

这是预期的,也是我想看到的,但是如果我 运行 这个:

select * from table where loaddate=(select max(loaddate) from table);
然后我得到这个错误:

ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 3, vertexId=vertex_1585179445264_14095_4_00, diagnostics=[Vertex vertex_1585179445264_14095_4_00 [Map 3] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: <Table> initializer failed, vertex=vertex_1585179445264_14095_4_00 [Map 3], java.lang.RuntimeException: ORC split generation failed with exception: java.io.FileNotFoundException: File hdfs://path/to/external/table/loaddate=20200327_0513 does not exist.
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:524)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:779)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)

所以它正在尝试加载一个不存在的分区,20200327_0513,这可能是什么原因造成的?

当您直接使用 rm 命令或类似 SaveMode.Overwrite write 命令删除分区时,它不会提醒 hive 分区的更改,因此您需要告诉 hive分区已更改。有很多方法可以做到这一点,我选择的修复方式是:

msck repair table <table> sync partitions