为什么 Hive Insert 中需要 Reducer
Why Reducer required in a Hive Insert
当我们从配置单元命令行触发插入语句时,问题与 MapReduce 作业的工作有关。
将记录插入配置单元时table:由于插入内部配置单元时不涉及聚合table,为什么还会调用reducer。它应该只是一个映射器作业。
这里reducer的作用是什么
insert into table values (1,1);
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2021-04-28 10:30:26,487 Stage-1 map = 0%, reduce = 0%
INFO : 2021-04-28 10:30:30,604 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.96 sec
INFO : 2021-04-28 10:30:36,774 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.35 sec
INFO : MapReduce Total cumulative CPU time: 3 seconds 350 msec
hive> set hive.merge.mapfiles;
hive.merge.mapfiles=true
hive> set hive.merge.mapredfiles;
hive.merge.mapredfiles=false
hive> set mapreduce.job.reduces;
mapreduce.job.reduces=-1
explain insert into test values (10,14);
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
Stage-4
Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
Stage-2 depends on stages: Stage-0
Stage-3
Stage-5
Stage-6 depends on stages: Stage-5
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: _dummy_table
Row Limit Per Split: 1
Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: array(const struct(10,14)) (type: array<struct<col1:int,col2:int>>)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
UDTF Operator
Statistics: Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
function name: inline
Select Operator
expressions: col1 (type: int), col2 (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Select Operator
expressions: _col0 (type: int), _col1 (type: int)
outputColumnNames: i, j
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: compute_stats(i, 'hll'), compute_stats(j, 'hll')
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 848 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 848 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,bitvector:binary>), _col1 (type: struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,bitvector:binary>)
Reduce Operator Tree:
Group By Operator
aggregations: compute_stats(VALUE._col0), compute_stats(VALUE._col1)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 880 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 880 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-7
Conditional Operator
Stage: Stage-4
Move Operator
files:
hdfs directory: true
destination:<path>
Stage: Stage-0
Move Operator
tables:
replace: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-2
Stats Work
Basic Stats Work:
Column Stats Desc:
Columns: i, j
Column Types: int, int
Table: db.test.test
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: db.test
Stage: Stage-5
Map Reduce
Map Operator Tree:
TableScan
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: db.test
Stage: Stage-6
Move Operator
files:
hdfs directory: true
destination: <PATH>
Time taken: 5.123 seconds, Fetched: 121 row(s)
您似乎启用了统计信息自动收集功能:
SET hive.stats.autogather=true;
并且reducer正在计算统计信息
Reduce Operator Tree:
Group By Operator
aggregations: **compute_stats**(VALUE._col0), compute_stats(VALUE._col1)
mode: mergepartial
当我们从配置单元命令行触发插入语句时,问题与 MapReduce 作业的工作有关。 将记录插入配置单元时table:由于插入内部配置单元时不涉及聚合table,为什么还会调用reducer。它应该只是一个映射器作业。 这里reducer的作用是什么
insert into table values (1,1);
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2021-04-28 10:30:26,487 Stage-1 map = 0%, reduce = 0%
INFO : 2021-04-28 10:30:30,604 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.96 sec
INFO : 2021-04-28 10:30:36,774 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.35 sec
INFO : MapReduce Total cumulative CPU time: 3 seconds 350 msec
hive> set hive.merge.mapfiles;
hive.merge.mapfiles=true
hive> set hive.merge.mapredfiles;
hive.merge.mapredfiles=false
hive> set mapreduce.job.reduces;
mapreduce.job.reduces=-1
explain insert into test values (10,14);
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
Stage-4
Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
Stage-2 depends on stages: Stage-0
Stage-3
Stage-5
Stage-6 depends on stages: Stage-5
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: _dummy_table
Row Limit Per Split: 1
Statistics: Num rows: 1 Data size: 10 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: array(const struct(10,14)) (type: array<struct<col1:int,col2:int>>)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
UDTF Operator
Statistics: Num rows: 1 Data size: 48 Basic stats: COMPLETE Column stats: COMPLETE
function name: inline
Select Operator
expressions: col1 (type: int), col2 (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Select Operator
expressions: _col0 (type: int), _col1 (type: int)
outputColumnNames: i, j
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: compute_stats(i, 'hll'), compute_stats(j, 'hll')
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 848 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 848 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,bitvector:binary>), _col1 (type: struct<columntype:string,min:bigint,max:bigint,countnulls:bigint,bitvector:binary>)
Reduce Operator Tree:
Group By Operator
aggregations: compute_stats(VALUE._col0), compute_stats(VALUE._col1)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 880 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 880 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-7
Conditional Operator
Stage: Stage-4
Move Operator
files:
hdfs directory: true
destination:<path>
Stage: Stage-0
Move Operator
tables:
replace: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-2
Stats Work
Basic Stats Work:
Column Stats Desc:
Columns: i, j
Column Types: int, int
Table: db.test.test
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: db.test
Stage: Stage-5
Map Reduce
Map Operator Tree:
TableScan
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: db.test
Stage: Stage-6
Move Operator
files:
hdfs directory: true
destination: <PATH>
Time taken: 5.123 seconds, Fetched: 121 row(s)
您似乎启用了统计信息自动收集功能:
SET hive.stats.autogather=true;
并且reducer正在计算统计信息
Reduce Operator Tree:
Group By Operator
aggregations: **compute_stats**(VALUE._col0), compute_stats(VALUE._col1)
mode: mergepartial