Hive 查询失败,出现文件未找到错误,这应该是不可能的
Hive query fails with file not found error that should not be possible
我有一个外部配置单元 table,它按插入的日期和时间进行分区,例如 20200331_0505
,格式为 YYYYMMDD_HHMM
。
目前只有一个分区:
> hdfs dfs -ls /path/to/external/table
-rw-r----- 2020-03-31 05:06 /path/to/external/table/_SUCCESS
drwxr-x--- 2020-03-31 05:06 /path/to/external/table/loaddate=20200331_0505
如果我 运行 一个配置单元查询来查找分区:
select distinct loaddate from table;
+----------------+
| loaddate |
+----------------+
| 20200331_0505 |
+----------------+
这是预期的,也是我想看到的,但是如果我 运行 这个:
select * from table where loaddate=(select max(loaddate) from table);
然后我得到这个错误:
ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 3, vertexId=vertex_1585179445264_14095_4_00, diagnostics=[Vertex vertex_1585179445264_14095_4_00 [Map 3] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: <Table> initializer failed, vertex=vertex_1585179445264_14095_4_00 [Map 3], java.lang.RuntimeException: ORC split generation failed with exception: java.io.FileNotFoundException: File hdfs://path/to/external/table/loaddate=20200327_0513 does not exist.
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:524)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:779)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
所以它正在尝试加载一个不存在的分区,20200327_0513
,这可能是什么原因造成的?
当您直接使用 rm
命令或类似 SaveMode.Overwrite
write 命令删除分区时,它不会提醒 hive 分区的更改,因此您需要告诉 hive分区已更改。有很多方法可以做到这一点,我选择的修复方式是:
msck repair table <table> sync partitions
我有一个外部配置单元 table,它按插入的日期和时间进行分区,例如 20200331_0505
,格式为 YYYYMMDD_HHMM
。
目前只有一个分区:
> hdfs dfs -ls /path/to/external/table
-rw-r----- 2020-03-31 05:06 /path/to/external/table/_SUCCESS
drwxr-x--- 2020-03-31 05:06 /path/to/external/table/loaddate=20200331_0505
如果我 运行 一个配置单元查询来查找分区:
select distinct loaddate from table;
+----------------+
| loaddate |
+----------------+
| 20200331_0505 |
+----------------+
这是预期的,也是我想看到的,但是如果我 运行 这个:
select * from table where loaddate=(select max(loaddate) from table);
然后我得到这个错误:
ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 3, vertexId=vertex_1585179445264_14095_4_00, diagnostics=[Vertex vertex_1585179445264_14095_4_00 [Map 3] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: <Table> initializer failed, vertex=vertex_1585179445264_14095_4_00 [Map 3], java.lang.RuntimeException: ORC split generation failed with exception: java.io.FileNotFoundException: File hdfs://path/to/external/table/loaddate=20200327_0513 does not exist.
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:524)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:779)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
所以它正在尝试加载一个不存在的分区,20200327_0513
,这可能是什么原因造成的?
当您直接使用 rm
命令或类似 SaveMode.Overwrite
write 命令删除分区时,它不会提醒 hive 分区的更改,因此您需要告诉 hive分区已更改。有很多方法可以做到这一点,我选择的修复方式是:
msck repair table <table> sync partitions