hdfs 的 hive 获取结果太慢,因为 map only 任务太多,执行 hive SQL 查询时如何合并查询结果
hive fetch result from hdfs is too slow, because of too many the map only task, How can I merge the query result When hive SQL query is executed
hive查询在“/tmp/hive/hive”的折叠中产生过多的结果文件,接近4Wtasks.But 运行个结果总数只有100多个
所以想知道有没有办法在查询后合并结果,减少结果文件的数量,提高拉取结果的效率?
这是查询的解释
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: kafka_program_log |
| filterExpr: ((msg like '%disk loss%') and (ds > '2022-05-01')) (type: boolean) |
| Statistics: Num rows: 36938084350 Data size: 11081425337136 Basic stats: PARTIAL Column stats: PARTIAL |
| Filter Operator |
| predicate: (msg like '%disk loss%') (type: boolean) |
| Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
| Select Operator |
| expressions: server (type: string), msg (type: string), ts (type: string), ds (type: string), h (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+--+
- 使用 ORC/Parquet 重新创建 table,您将获得更好的性能。这是加快速度的第一要务。
- 您使用的是 like 运算符,这意味着扫描所有数据。您可能需要考虑,re-writing 改为使用 join/where 子句。这将 运行 快得多。下面是您可以做些什么来让事情变得更好的例子。
with words as --short cut for readable sub-query
(
select
log.msg
from
kafka_program_log log
lateral view EXPLODE(split(msg, ' ')) words as word -- for each word in msg, make a row assumes ' disk loss ' is what is in the msg
where
word in ('disk', 'loss' ) -- filter the words to the ones we care about.
and
ds > '2022-05-01' -- filter dates to the ones we care about.
group by
log.msg -- gather the msgs together
having
count(word) >= 2 -- only pull back msg that have at least two words we are interested in.
) -- end sub-query
select
*
from kafka_program_log log
inner join
words.msg = log.msg // This join should really reduce the data we examine
where
msg like "%disk loss%" -- like is fine now to make sure it's exactly what we're looking for.
设置mapred.max.split.size=2560000000;
增加单张图处理的文件大小,从而减少图数
hive查询在“/tmp/hive/hive”的折叠中产生过多的结果文件,接近4Wtasks.But 运行个结果总数只有100多个 所以想知道有没有办法在查询后合并结果,减少结果文件的数量,提高拉取结果的效率?
这是查询的解释
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: kafka_program_log |
| filterExpr: ((msg like '%disk loss%') and (ds > '2022-05-01')) (type: boolean) |
| Statistics: Num rows: 36938084350 Data size: 11081425337136 Basic stats: PARTIAL Column stats: PARTIAL |
| Filter Operator |
| predicate: (msg like '%disk loss%') (type: boolean) |
| Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
| Select Operator |
| expressions: server (type: string), msg (type: string), ts (type: string), ds (type: string), h (type: string) |
| outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
| Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 18469042175 Data size: 5540712668568 Basic stats: COMPLETE Column stats: PARTIAL |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+--+
- 使用 ORC/Parquet 重新创建 table,您将获得更好的性能。这是加快速度的第一要务。
- 您使用的是 like 运算符,这意味着扫描所有数据。您可能需要考虑,re-writing 改为使用 join/where 子句。这将 运行 快得多。下面是您可以做些什么来让事情变得更好的例子。
with words as --short cut for readable sub-query
(
select
log.msg
from
kafka_program_log log
lateral view EXPLODE(split(msg, ' ')) words as word -- for each word in msg, make a row assumes ' disk loss ' is what is in the msg
where
word in ('disk', 'loss' ) -- filter the words to the ones we care about.
and
ds > '2022-05-01' -- filter dates to the ones we care about.
group by
log.msg -- gather the msgs together
having
count(word) >= 2 -- only pull back msg that have at least two words we are interested in.
) -- end sub-query
select
*
from kafka_program_log log
inner join
words.msg = log.msg // This join should really reduce the data we examine
where
msg like "%disk loss%" -- like is fine now to make sure it's exactly what we're looking for.
设置mapred.max.split.size=2560000000;
增加单张图处理的文件大小,从而减少图数