仅当 HIVE 中不存在分区时,如何插入覆盖分区?
How to insert overwrite partitions only if partitions not exists in HIVE?
只有在HIVE中不存在分区时,如何插入覆盖分区?
如题。我正在做一些总是需要重写配置单元表的事情。我有多个分区的表,当我在更改后重新运行代码时,我只想插入新分区而不更改现有分区。
您可以加入现有的分区列表并在它是NULL 的地方添加条件(仅不加入)。您也可以使用 NOT EXISTS(它将生成与 Hive 中的左连接相同的计划),如下所示:
insert overwrite table target_table partition (partition_key)
select col1, ... coln, s.partition_key
from source s
left join (select distinct partition_key --existing partitions
from target_table
) t on s.partition_key=t.partition_key
where t.partition_key is NULL; --no partitions exists in the target
一个选项是连接(在分区列上作为键左连接)具有与目标不同的分区列的源数据集 table 并过滤掉共同的分区。你知道我的意思;您的 Hive 查询应如下所示:
insert overwrite table target_table partition (partition_column1, partition_column2, ..., partition_columnN)
select
src.column1,
src.column2,
....,
src.columnN,
src.partition_column1,
src.partition_column2,
....,
src.partition_columnN
from
source src
left join
(
select distinct
partition_column1,
partition_column2,
....,
partition_columnN
from
target
)
tgt
on src.partition_column1 = tgt.partition_column1
and src.partition_column1 = tgt.partition_column1
...
src.partition_columnN = tgt.partition_columnN
where
tgt.partition_column1 is null
or tgt.partition_column2 is null
...
tgt.partition_columnN is null;
下面给出了该逻辑的简单演示:
让我们创建两个 table 命名订单和 orders_source。订单 table 将成为目标 table,而 orders_source 将成为源 table。为简单起见,我对 tables.
使用了类似的模式
CREATE TABLE `orders`(
`id` int,
`customer_id` int,
`shipper_id` int)
PARTITIONED BY (
`state` string,
`order_date` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.bloom.filter.columns'='id,customer_id',
'orc.compress'='SNAPPY',
'orc.compress.size'='262144',
'orc.create.index'='true',
'orc.row.index.stride'='3000',
'orc.stripe.size'='268435456');
CREATE TABLE `orders_source`(
`id` int,
`customer_id` int,
`shipper_id` int)
PARTITIONED BY (
`state` string,
`order_date` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.bloom.filter.columns'='id,customer_id',
'orc.compress'='SNAPPY',
'orc.compress.size'='262144',
'orc.create.index'='true',
'orc.row.index.stride'='3000',
'orc.stripe.size'='268435456');
接下来,插入一些示例记录来测试逻辑:
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
insert overwrite table orders partition (state, order_date)
select
orde.id,
orde.customer_id,
orde.shipper_id,
orde.state,
orde.order_date
from
(
select
10240 as id,
20480 as customer_id,
30720 as shipper_id,
'CA' as state,
'2019-09-01' as order_date
union all
select
10241 as id,
20481 as customer_id,
30721 as shipper_id,
'GA' as state,
'2019-09-01' as order_date
)
orde;
insert overwrite table orders_source partition (state, order_date)
select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
(
select
10240 as id,
20480 as customer_id,
30720 as shipper_id,
'CA' as state,
'2019-09-01' as order_date
union all
select
10242 as id,
20482 as customer_id,
30722 as shipper_id,
'CA' as state,
'2019-09-02' as order_date
union all
select
10243 as id,
20483 as customer_id,
30723 as shipper_id,
'FL' as state,
'2019-09-02' as order_date
union all
select
10244 as id,
20484 as customer_id,
30724 as shipper_id,
'TX' as state,
'2019-09-02' as order_date
)
orso;
现在,让我们检查在 运行 建立我们的实际业务逻辑之前在两个 table 中插入的数据:
hive (default)> select * from orders_source;
OK
orders_source.id orders_source.customer_id orders_source.shipper_id orders_source.state orders_source.order_date
10240 20480 30720 CA 2019-09-01
10242 20482 30722 CA 2019-09-02
10243 20483 30723 FL 2019-09-02
10244 20484 30724 TX 2019-09-02
Time taken: 0.085 seconds, Fetched: 4 row(s)
hive (default)> select * from orders;
OK
orders.id orders.customer_id orders.shipper_id orders.state orders.order_date
10240 20480 30720 CA 2019-09-01
10241 20481 30721 GA 2019-09-01
Time taken: 0.073 seconds, Fetched: 2 row(s)
接下来,执行我们的逻辑以 select 来自源 table 的记录并插入目标 table。您可以 运行 以下查询:
hive (default)> select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
orders_source orso
left join
(
select distinct
state,
order_date
from
orders
)
orde
on orso.state = orde.state
and orso.order_date = orde.order_date
where
orde.state is null
or orde.order_date is null;
OK
orso.id orso.customer_id orso.shipper_id orso.state orso.order_date
10243 20483 30723 FL 2019-09-02
10244 20484 30724 TX 2019-09-02
10242 20482 30722 CA 2019-09-02
Time taken: 11.113 seconds, Fetched: 3 row(s)
可以看到上面的结果
最后通过发出以下查询将记录插入到目标 table 中:
insert overwrite table orders partition (state, order_date)
select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
orders_source orso
left join
(
select distinct
state,
order_date
from
orders
)
orde
on orso.state = orde.state
and orso.order_date = orde.order_date
where
orde.state is null
or orde.order_date is null;
现在,让我们在插入操作后验证目标table中的数据。
hive (default)> select * from orders;
OK
orders.id orders.customer_id orders.shipper_id orders.state orders.order_date
10240 20480 30720 CA 2019-09-01
10242 20482 30722 CA 2019-09-02
10243 20483 30723 FL 2019-09-02
10241 20481 30721 GA 2019-09-01
10244 20484 30724 TX 2019-09-02
Time taken: 0.074 seconds, Fetched: 5 row(s)
就是这样。大功告成!
只有在HIVE中不存在分区时,如何插入覆盖分区?
如题。我正在做一些总是需要重写配置单元表的事情。我有多个分区的表,当我在更改后重新运行代码时,我只想插入新分区而不更改现有分区。
您可以加入现有的分区列表并在它是NULL 的地方添加条件(仅不加入)。您也可以使用 NOT EXISTS(它将生成与 Hive 中的左连接相同的计划),如下所示:
insert overwrite table target_table partition (partition_key)
select col1, ... coln, s.partition_key
from source s
left join (select distinct partition_key --existing partitions
from target_table
) t on s.partition_key=t.partition_key
where t.partition_key is NULL; --no partitions exists in the target
一个选项是连接(在分区列上作为键左连接)具有与目标不同的分区列的源数据集 table 并过滤掉共同的分区。你知道我的意思;您的 Hive 查询应如下所示:
insert overwrite table target_table partition (partition_column1, partition_column2, ..., partition_columnN)
select
src.column1,
src.column2,
....,
src.columnN,
src.partition_column1,
src.partition_column2,
....,
src.partition_columnN
from
source src
left join
(
select distinct
partition_column1,
partition_column2,
....,
partition_columnN
from
target
)
tgt
on src.partition_column1 = tgt.partition_column1
and src.partition_column1 = tgt.partition_column1
...
src.partition_columnN = tgt.partition_columnN
where
tgt.partition_column1 is null
or tgt.partition_column2 is null
...
tgt.partition_columnN is null;
下面给出了该逻辑的简单演示:
让我们创建两个 table 命名订单和 orders_source。订单 table 将成为目标 table,而 orders_source 将成为源 table。为简单起见,我对 tables.
使用了类似的模式CREATE TABLE `orders`(
`id` int,
`customer_id` int,
`shipper_id` int)
PARTITIONED BY (
`state` string,
`order_date` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.bloom.filter.columns'='id,customer_id',
'orc.compress'='SNAPPY',
'orc.compress.size'='262144',
'orc.create.index'='true',
'orc.row.index.stride'='3000',
'orc.stripe.size'='268435456');
CREATE TABLE `orders_source`(
`id` int,
`customer_id` int,
`shipper_id` int)
PARTITIONED BY (
`state` string,
`order_date` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.bloom.filter.columns'='id,customer_id',
'orc.compress'='SNAPPY',
'orc.compress.size'='262144',
'orc.create.index'='true',
'orc.row.index.stride'='3000',
'orc.stripe.size'='268435456');
接下来,插入一些示例记录来测试逻辑:
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
insert overwrite table orders partition (state, order_date)
select
orde.id,
orde.customer_id,
orde.shipper_id,
orde.state,
orde.order_date
from
(
select
10240 as id,
20480 as customer_id,
30720 as shipper_id,
'CA' as state,
'2019-09-01' as order_date
union all
select
10241 as id,
20481 as customer_id,
30721 as shipper_id,
'GA' as state,
'2019-09-01' as order_date
)
orde;
insert overwrite table orders_source partition (state, order_date)
select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
(
select
10240 as id,
20480 as customer_id,
30720 as shipper_id,
'CA' as state,
'2019-09-01' as order_date
union all
select
10242 as id,
20482 as customer_id,
30722 as shipper_id,
'CA' as state,
'2019-09-02' as order_date
union all
select
10243 as id,
20483 as customer_id,
30723 as shipper_id,
'FL' as state,
'2019-09-02' as order_date
union all
select
10244 as id,
20484 as customer_id,
30724 as shipper_id,
'TX' as state,
'2019-09-02' as order_date
)
orso;
现在,让我们检查在 运行 建立我们的实际业务逻辑之前在两个 table 中插入的数据:
hive (default)> select * from orders_source;
OK
orders_source.id orders_source.customer_id orders_source.shipper_id orders_source.state orders_source.order_date
10240 20480 30720 CA 2019-09-01
10242 20482 30722 CA 2019-09-02
10243 20483 30723 FL 2019-09-02
10244 20484 30724 TX 2019-09-02
Time taken: 0.085 seconds, Fetched: 4 row(s)
hive (default)> select * from orders;
OK
orders.id orders.customer_id orders.shipper_id orders.state orders.order_date
10240 20480 30720 CA 2019-09-01
10241 20481 30721 GA 2019-09-01
Time taken: 0.073 seconds, Fetched: 2 row(s)
接下来,执行我们的逻辑以 select 来自源 table 的记录并插入目标 table。您可以 运行 以下查询:
hive (default)> select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
orders_source orso
left join
(
select distinct
state,
order_date
from
orders
)
orde
on orso.state = orde.state
and orso.order_date = orde.order_date
where
orde.state is null
or orde.order_date is null;
OK
orso.id orso.customer_id orso.shipper_id orso.state orso.order_date
10243 20483 30723 FL 2019-09-02
10244 20484 30724 TX 2019-09-02
10242 20482 30722 CA 2019-09-02
Time taken: 11.113 seconds, Fetched: 3 row(s)
可以看到上面的结果
最后通过发出以下查询将记录插入到目标 table 中:
insert overwrite table orders partition (state, order_date)
select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
orders_source orso
left join
(
select distinct
state,
order_date
from
orders
)
orde
on orso.state = orde.state
and orso.order_date = orde.order_date
where
orde.state is null
or orde.order_date is null;
现在,让我们在插入操作后验证目标table中的数据。
hive (default)> select * from orders;
OK
orders.id orders.customer_id orders.shipper_id orders.state orders.order_date
10240 20480 30720 CA 2019-09-01
10242 20482 30722 CA 2019-09-02
10243 20483 30723 FL 2019-09-02
10241 20481 30721 GA 2019-09-01
10244 20484 30724 TX 2019-09-02
Time taken: 0.074 seconds, Fetched: 5 row(s)
就是这样。大功告成!