Hive：读取 subselect 中定义的 table 个分区

Question

我有一个按 partitionDate 字段分区的 Hive table。我可以通过简单

读取我选择的分区

select * from myTable where partitionDate = '2000-01-01'

我的任务是动态指定我选择的分区。 IE。首先我想从一些 table 读取它，然后才运行 select 到 myTable。当然，我希望使用分区的功能。

我写了一个看起来像

的查询

select * from myTable mt join thatTable tt on tt.reportDate = mt.partitionDate

查询有效，但似乎未使用分区。查询工作时间过长。

我尝试了另一种方法：

select * from myTable where partitionDate in (select reportDate from thatTable)

..我又一次发现查询运行得太慢了。

有没有办法在 Hive 中实现这个？

更新：为 myTable

创建 table

CREATE TABLE `myTable`(            
  `theDate` string,            
 ')            
PARTITIONED BY (           
  `partitionDate` string) 
TBLPROPERTIES (             
  'DO_NOT_UPDATE_STATS'='true',         
  'STATS_GENERATED_VIA_STATS_TASK'='true',                
  'spark.sql.create.version'='2.2 or prior',              
  'spark.sql.sources.schema.numPartCols'='1',    
  'spark.sql.sources.schema.numParts'='2',          
  'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"theDate","type":"string","nullable":true}...         
  'spark.sql.sources.schema.part.1'='{"name":"partitionDate","type":"string","nullable":true}...',               
  'spark.sql.sources.schema.partCol.0'='partitionDate')

Answer 1

如果你是运行 Hive on Tez 执行引擎，试试

set hive.tez.dynamic.partition.pruning=true;

在 Jira 中阅读更多详细信息和相关配置 HIVE-7826

并同时尝试重写为 LEFT SEMI JOIN:

select * 
  from myTable t 
       left semi join (select distinct reportDate from thatTable) s on t.partitionDate = s.reportDate

如果没有任何帮助，请参阅此解决方法：

或者这个：

类似问题：Hive Query is going for full table scan when filtering on the partitions from the results of subquery/joins

Hive：读取 subselect 中定义的 table 个分区

Hive: read table partitions defined in subselect

sql

hive

query-optimization

partition

hive-partitions