蜂巢中的分区

Question

我在 CDH 集群上有数据集，它按 yyyymm 分区。

当我运行配置单元上的以下查询时：

select actvydt, cast((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2)))) as int) from pos where yyyymm=201601 and actvydt>='2016-01-01' and actvydt<='2016-01-09' limit 10;

它命中了数据集中 201601 的右分区。

结果如下：

actvydt     yyyymm
2016-01-02  201601
2016-01-02  201601
2016-01-02  201601

但是当我运行下面的查询时：（只是通过 subst 和 concat 函数传递 yyyymm 的参数）

select actvydt,cast((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2)))) as int) from pos.pos_sales_weekly where yyyymm=cast(trim((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2))))) as int) and actvydt>='2016-01-01' and actvydt<='2016-01-09' limit 10;

它正在影响整个数据集。因此 yyyymm 的值未正确传递。这个函数有一些问题：

 cast((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2)))) as int)

但是函数的值是作为列传递的，可以在上面的结果中看到。它显示正确的参数 201601。任何帮助将不胜感激。

下面是 table 架构： CREATE EXTERNAL TABLE IF NOT EXISTS pos (nid bigint, actvydt date, upc string, tchid string, posfileid string, yssk bigint) PARTITIONED BY (yyyymm int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/data/' TBLPROPERTIES ( 'avro.output.codec'='snappy' );

Answer 1

分区键值必须在查询执行前已知，分区修剪才能起作用。您正在使用 WHERE 子句：yyyymm=cast(trim((concat(trim(substr(ActvyDt, 1, 4)), trim(substr(ActvyDt, 6, 2))))) as int) and actvydt>='2016-01-01' and actvydt<='2016-01-09'

不幸的是，Optimizer 没有这样的智能，无法在执行查询之前从相当复杂的函数中推断出 yyyymm 值。尝试另外添加显式条件：yyyymm='201601' 这将起作用。您可以将其作为变量传递。

Answer 2

不知何故，价值 2016-01-01 被创造出来了。

正好在那一刻，或者非常接近那一刻，您应该也能够创造 201601。

一旦你这样做了，你可以像传递2016-01-01一样将它传递给查询，然后你的问题就应该解决了。

蜂巢中的分区

Partitions in hive

mysql

hadoop

hive

hiveql

cloudera-cdh