Hive CLI 和 Beeline jdbc:hive2 在插入百万条记录的执行引擎 tez 中表现不同?
Hive CLI and Beeline jdbc:hive2 behave differently in execution engine tez for insert million records?
从具有数百万条记录(20GB 大小)的大型 table 执行插入到空 table 时。在 hive CLI 和 beeline 中执行不同。
Hive CLI:它在 Yarn 中创建两个 TEZ 作业,可能是 mapper 和 reducer,并在大约 413 秒内完成。
Beeline:它在 Yarn 中创建第一个 TEZ 作业,其他是 MapReduce 作业,超过 150 个作业,耗时将近 2 小时。
这是 TEZ 作业的 hiveserver2 直线的预期行为,因为它在内部创建 MapReduce 作业?
环境详情:
- 蜂巢版本:2.1.1
- Tez 版本:0.8.5
hive 常用设置:
- hive.execution.engine=tez
- hive.mv.files.thread=0
直线设置:
- tez.am.resource.memory.mb=20000
- mapreduce.map.memory.mb=20000
- hive.vectorized.execution.reduce.enabled=假;
提前致谢。
**Hive CLI log:**
`2018-08-07T18:22:56,482 INFO [main] exec.Task: Dag name: insert into default.t...db.temp_large_table3(Stage-1)
2018-08-07T18:22:56,493 INFO [main] ql.Context: New scratch dir is hdfs://edhcluster/tmp/hive/scratch/hive/a501276d-2015-435b-85c5-4d40534ac162/hive_2018-08-07_18-22-53_167_2618699013418541798-1
2018-08-07T18:22:56,532 INFO [main] tez.DagUtils: Vertex has custom input? false
2018-08-07T18:22:56,589 INFO [main] exec.SerializationUtilities: Serializing MapWork using kryo
2018-08-07T18:22:56,601 INFO [main] exec.Utilities: Setting plan: /tmp/hive/scratch/hive/a501276d-2015-435b-85c5-4d40534ac162/hive_2018-08-07_18-22-53_167_2618699013418541798-1/hive/_tez_scratch_dir/d5cc1718-38b1-49ba-a97e-ab9f78415b62/map.xml
2018-08-07T18:22:56,669 INFO [main] fs.FSStatsPublisher: created : hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10001
2018-08-07T18:22:56,686 INFO [main] client.TezClient: Submitting dag to TezSession, sessionName=HIVE-a501276d-2015-435b-85c5-4d40534ac162, applicationId=application_1533623337748_0376, dagName=insert into default.t...db.temp_large_table3(Stage-1), callerContext={ context=HIVE, callerType=HIVE_QUERY_ID, callerId=hive_20180807182253_52487095-48c1-4847-92cd-6e60121e8ae2 }
2018-08-07T18:22:57,206 INFO [main] client.TezClient: Submitted dag to TezSession, sessionName=HIVE-a501276d-2015-435b-85c5-4d40534ac162, applicationId=application_1533623337748_0376, dagId=dag_1533623337748_0376_1, dagName=insert into default.t...db.temp_large_table3(Stage-1)
2018-08-07T18:22:57,277 INFO [main] SessionState:
2018-08-07T18:22:57,719 INFO [main] SessionState: Status: Running (Executing on YARN cluster with App id application_1533623337748_0376)
2018-08-07T18:22:57,721 INFO [main] SessionState: Map 1: 0/165
2018-08-07T18:23:00,542 INFO [main] SessionState: Map 1: 0(+1)/165
2018-08-07T18:23:01,551 INFO [main] SessionState: Map 1: 0(+2)/165
:
:
2018-08-07T18:30:01,688 INFO [main] SessionState: Map 1: 165/165
2018-08-07T18:30:01,713 INFO [main] counters.Limits: Counter limits initialized with parameters: GROUP_NAME_MAX=256, MAX_GROUPS=500, COUNTER_NAME_MAX=64, MAX_COUNTERS=1200
2018-08-07T18:30:01,726 INFO [main] exec.FileSinkOperator: Moving tmp dir: hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/_tmp.-ext-10000 to: hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10000
2018-08-07T18:30:01,796 INFO [main] ql.Driver: Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
2018-08-07T18:30:01,796 INFO [main] ql.Driver: Starting task [Stage-0:MOVE] in serial mode
2018-08-07T18:30:01,797 INFO [main] exec.Task: Loading data to table default.temp_tro1 from hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10000
2018-08-07T18:30:11,683 WARN [main] serde2.AbstractEncodingAwareSerDe: The data may not be properly converted to target charset ISO-8859-1
2018-08-07T18:30:11,759 INFO [main] ql.Driver: Starting task [Stage-3:STATS] in serial mode
2018-08-07T18:30:11,759 INFO [main] exec.StatsTask: Executing stats task
2018-08-07T18:30:11,891 INFO [main] fs.FSStatsPublisher: created : hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10001
2018-08-07T18:30:11,985 INFO [main] fs.FSStatsAggregator: Read stats : {default.temp_tro1/={rawDataSize=133428373, numRows=789517}}
2018-08-07T18:30:12,003 INFO [main] fs.FSStatsAggregator: Read stats : {default.temp_tro1/={rawDataSize=133428204, numRows=789516}}
`
====
**Beeline log:**
`2018-08-07T16:29:13,856 INFO [HiveServer2-Background-Pool: Thread-1549] ql.Context: New scratch dir is hdfs://edhcluster/tmp/hive/scratch/hive/0887b266-675a
-4fb2-8c85-3a27ebb3b9fc/hive_2018-08-07_16-29-12_750_8973639287951385407-3
2018-08-07T16:29:13,900 INFO [HiveServer2-Background-Pool: Thread-1549] tez.DagUtils: Vertex has custom input? false
2018-08-07T16:29:13,901 INFO [HiveServer2-Background-Pool: Thread-1549] exec.SerializationUtilities: Serializing MapWork using kryo
2018-08-07T16:29:13,903 INFO [HiveServer2-Background-Pool: Thread-1549] exec.Utilities: Setting plan: /tmp/hive/scratch/hive/0887b266-675a-4fb2-8c85-3a27ebb
3b9fc/hive_2018-08-07_16-29-12_750_8973639287951385407-3/hive/_tez_scratch_dir/6f4620d8-310c-4aff-bbe8-6f69ea9d1341/map.xml
2018-08-07T16:29:13,934 INFO [HiveServer2-Background-Pool: Thread-1549] fs.FSStatsPublisher: created : hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16
-29-12_750_8973639287951385407-1/-ext-10001
2018-08-07T16:29:13,938 INFO [HiveServer2-Background-Pool: Thread-1549] client.TezClient: Submitting dag to TezSession, sessionName=HIVE-e2dfe4df-37f0-4d95-
946d-30557075f807, applicationId=application_1533623337748_0148, dagName=insert into default.t...db.temp_large_table3(Stage-1), callerContext={ context=HIVE,
callerType=HIVE_QUERY_ID, callerId=hive_20180807162912_519c1503-c151-4da7-b5a2-bd067e9c42b9 }
2018-08-07T16:29:14,127 INFO [HiveServer2-Background-Pool: Thread-1549] client.TezClient: Submitted dag to TezSession, sessionName=HIVE-e2dfe4df-37f0-4d95-9
46d-30557075f807, applicationId=application_1533623337748_0148, dagId=dag_1533623337748_0148_2, dagName=insert into default.t...db.temp_large_table3(Stage-1)
2018-08-07T16:29:14,185 INFO [HiveServer2-Background-Pool: Thread-1549] SessionState:
2018-08-07T16:29:14,390 INFO [HiveServer2-Background-Pool: Thread-1549] SessionState: Status: Running (Executing on YARN cluster with App id application_153
3623337748_0148)
2018-08-07T16:29:14,390 INFO [HiveServer2-Background-Pool: Thread-1549] SessionState: Map 1: 0/171
2018-08-07T16:29:16,600 INFO [HiveServer2-Background-Pool: Thread-1549] SessionState: Map 1: 0(+3)/171
:
:
2018-08-07T16:35:57,955 INFO [HiveServer2-Background-Pool: Thread-1549] SessionState: Map 1: 171/171
2018-08-07T16:35:57,963 INFO [HiveServer2-Background-Pool: Thread-1549] exec.FileSinkOperator: Moving tmp dir: hdfs://edhcluster/tmp/hive/staging_hive_2018-
08-07_16-29-12_750_8973639287951385407-1/_tmp.-ext-10000 to: hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16-29-12_750_8973639287951385407-1/-ext-10000
2018-08-07T16:35:57,996 INFO [HiveServer2-Background-Pool: Thread-1549] ql.Driver: Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
2018-08-07T16:35:57,996 INFO [HiveServer2-Background-Pool: Thread-1549] ql.Driver: Starting task [Stage-0:MOVE] in serial mode
2018-08-07T16:35:57,996 INFO [HiveServer2-Background-Pool: Thread-1549] exec.Task: Loading data to table default.temp_tro from hdfs://edhcluster/tmp/hive/st
aging_hive_2018-08-07_16-29-12_750_8973639287951385407-1/-ext-10000
2018-08-07T16:35:58,158 INFO [HiveServer2-Background-Pool: Thread-1549] metadata.Hive: Copying source hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16-
29-12_750_8973639287951385407-1/-ext-10000/000000_0 to hdfs://edhcluster/user/hive/warehouse/temp_tro/000000_0 because HDFS encryption zones are different.
2018-08-07T16:35:58,158 INFO [HiveServer2-Background-Pool: Thread-1549] common.FileUtils: Source is 129368810 bytes. (MAX: 33554432)
2018-08-07T16:35:58,158 INFO [HiveServer2-Background-Pool: Thread-1549] common.FileUtils: Launch distributed copy (distcp) job.
2018-08-07T16:35:58,239 INFO [HiveServer2-Background-Pool: Thread-1549] tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, dele
teMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://edhcl
uster/tmp/hive/staging_hive_2018-08-07_16-29-12_750_8973639287951385407-1/-ext-10000/000000_0], targetPath=hdfs://edhcluster/user/hive/warehouse/temp_tro/000
000_0, targetPathExists=false, preserveRawXattrs=false}
2018-08-07T16:35:58,288 INFO [HiveServer2-Background-Pool: Thread-1549] hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 10139 for hive on ha-hdfs:edhclu
ster
2018-08-07T16:35:58,295 INFO [HiveServer2-Background-Pool: Thread-1549] security.TokenCache: Got dt for hdfs://edhcluster; Kind: HDFS_DELEGATION_TOKEN, Serv
ice: ha-hdfs:edhcluster, Ident: (HDFS_DELEGATION_TOKEN token 10139 for hive)
2018-08-07T16:35:58,295 WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind kms-dt
2018-08-07T16:35:58,295 INFO [HiveServer2-Background-Pool: Thread-1549] security.TokenCache: Got dt for hdfs://edhcluster; Kind: kms-dt, Service: 160.88.112
.163:9393, Ident: 00 04 68 69 76 65 04 79 61 72 6e 04 68 69 76 65 8a 01 65 13 87 4a d6 8a 01 65 37 93 ce d6 8e 1d f6 24
2018-08-07T16:35:58,738 WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind kms-dt
2018-08-07T16:35:58,738 INFO [HiveServer2-Background-Pool: Thread-1549] security.TokenCache: Got dt for hdfs://edhcluster; Kind: kms-dt, Service: 160.88.112
.162:9393, Ident: 00 04 68 69 76 65 04 79 61 72 6e 04 68 69 76 65 8a 01 65 13 87 4c 91 8a 01 65 37 93 d0 91 8e 1d 49 23
2018-08-07T16:35:58,745 WARN [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Imp
lement the Tool interface and execute your application with ToolRunner to remedy this.
2018-08-07T16:35:59,000 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:35:59,000 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:35:59,130 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: number of splits:1
2018-08-07T16:35:59,313 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Submitting tokens for job: job_1533623337748_0215
2018-08-07T16:35:59,313 WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind kms-dt
2018-08-07T16:35:59,313 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Kind: kms-dt, Service: 160.88.112.162:9393, Ident: 00 04 68
69 76 65 04 79 61 72 6e 04 68 69 76 65 8a 01 65 13 87 4c 91 8a 01 65 37 93 d0 91 8e 1d 49 23
2018-08-07T16:35:59,313 WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind kms-dt
2018-08-07T16:35:59,313 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Kind: kms-dt, Service: 160.88.112.163:9393, Ident: 00 04 68
69 76 65 04 79 61 72 6e 04 68 69 76 65 8a 01 65 13 87 4a d6 8a 01 65 37 93 ce d6 8e 1d f6 24
2018-08-07T16:35:59,313 WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind HIVE_DELEGATION_TOKEN
2018-08-07T16:35:59,313 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Kind: HIVE_DELEGATION_TOKEN, Service: HiveServer2Impersonati
onToken, Ident: 00 04 68 69 76 65 04 68 69 76 65 00 8a 01 65 13 76 fe e0 8a 01 65 37 83 82 e0 01 8e 02 73
2018-08-07T16:35:59,313 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:edhcluster, Id
ent: (HDFS_DELEGATION_TOKEN token 10139 for hive)
2018-08-07T16:35:59,520 INFO [HiveServer2-Background-Pool: Thread-1549] impl.YarnClientImpl: Submitted application application_1533623337748_0215
2018-08-07T16:35:59,521 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: The url to track the job: https://bdtst1a.server:8888
/proxy/application_1533623337748_0215/
2018-08-07T16:35:59,521 INFO [HiveServer2-Background-Pool: Thread-1549] tools.DistCp: DistCp job-id: job_1533623337748_0215
2018-08-07T16:35:59,521 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: Running job: job_1533623337748_0215
2018-08-07T16:36:04,002 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:04,002 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:06,571 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: Job job_1533623337748_0215 running in uber mode : false
2018-08-07T16:36:06,571 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: map 0% reduce 0%
2018-08-07T16:36:09,004 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:09,004 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:14,006 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:14,006 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:18,603 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: map 100% reduce 0%
2018-08-07T16:36:19,007 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:19,007 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:24,009 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:24,009 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:29,010 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:29,010 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:34,012 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:34,012 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:37,641 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: Job job_1533623337748_0215 completed successfully
2018-08-07T16:36:37,664 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=297867
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=129369232
HDFS: Number of bytes written=129368810
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=85701
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=28567
Total vcore-milliseconds taken by all map tasks=28567
Total megabyte-milliseconds taken by all map tasks=571340000
Map-Reduce Framework
Map input records=1
Map output records=0
Input split bytes=134
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=84
CPU time spent (ms)=31920
Physical memory (bytes) snapshot=542441472
Virtual memory (bytes) snapshot=19353210880
Total committed heap usage (bytes)=826277888
File Input Format Counters
Bytes Read=288
File Output Format Counters
Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=129368810
BYTESEXPECTED=129368810
COPY=1
2018-08-07T16:36:37,706 INFO [HiveServer2-Background-Pool: Thread-1549] metadata.Hive: Copying source hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16-
29-12_750_8973639287951385407-1/-ext-10000/000001_0 to hdfs://edhcluster/user/hive/warehouse/temp_tro/000001_0 because HDFS encryption zones are different.
2018-08-07T16:36:37,706 INFO [HiveServer2-Background-Pool: Thread-1549] common.FileUtils: Source is 129368980 bytes. (MAX: 33554432)
2018-08-07T16:36:37,706 INFO [HiveServer2-Background-Pool: Thread-1549] common.FileUtils: Launch distributed copy (distcp) job.
2018-08-07T16:36:37,783 INFO [HiveServer2-Background-Pool: Thread-1549] tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, dele
teMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://edhcl
uster/tmp/hive/staging_hive_2018-08-07_16-29-12_750_8973639287951385407-1/-ext-10000/000001_0], targetPath=hdfs://edhcluster/user/hive/warehouse/temp_tro/000
001_0, targetPathExists=false, preserveRawXattrs=false}
`
:
:
更新:
发现HIVE CLI的hive用户使用了.hiverc,所以发现了差异。
hive.exec.scratchdir=/user/hive/scratch
hive.exec.stagingdir=/user/hive/staging
问题是 hdfs /user/hive 目录是用 Ranger 加密的,而 hdfs /tmp/hive 目录是未加密的,hadoop 组中的所有用户都可以 read/write。
我测试了会话级别更改的直线。执行速度与 HIVE CLI 一样快。
hive.exec.scratchdir=/user/hive/scratch
hive.exec.stagingdir=/user/hive/staging
我使用会话级别更改的 HIVE CLI 进行了测试。使用 MAP reduce 作业移动数据时执行速度很慢。
hive.exec.scratchdir=/tmp/hive/scratch
hive.exec.stagingdir=/tmp/hive/staging
所以根本原因是数据在 /user/hive 中加密而不是在 /tmp/hive 中加密。
解决方案是更改会话级别以使用相同的加密区域。
因此如果加密区域不同,将打印以下信息日志。
metadata.Hive: Copying source hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16- 29-12_750_8973639287951385407-1/-ext-10000/000001_0 to hdfs://edhcluster/user/hive/warehouse/temp_tro/000001_0 because HDFS encryption zones are different.
谢谢,
曼吉尔
从具有数百万条记录(20GB 大小)的大型 table 执行插入到空 table 时。在 hive CLI 和 beeline 中执行不同。
Hive CLI:它在 Yarn 中创建两个 TEZ 作业,可能是 mapper 和 reducer,并在大约 413 秒内完成。
Beeline:它在 Yarn 中创建第一个 TEZ 作业,其他是 MapReduce 作业,超过 150 个作业,耗时将近 2 小时。
这是 TEZ 作业的 hiveserver2 直线的预期行为,因为它在内部创建 MapReduce 作业?
环境详情:
- 蜂巢版本:2.1.1
- Tez 版本:0.8.5
hive 常用设置:
- hive.execution.engine=tez
- hive.mv.files.thread=0
直线设置:
- tez.am.resource.memory.mb=20000
- mapreduce.map.memory.mb=20000
- hive.vectorized.execution.reduce.enabled=假;
提前致谢。
**Hive CLI log:**
`2018-08-07T18:22:56,482 INFO [main] exec.Task: Dag name: insert into default.t...db.temp_large_table3(Stage-1)
2018-08-07T18:22:56,493 INFO [main] ql.Context: New scratch dir is hdfs://edhcluster/tmp/hive/scratch/hive/a501276d-2015-435b-85c5-4d40534ac162/hive_2018-08-07_18-22-53_167_2618699013418541798-1
2018-08-07T18:22:56,532 INFO [main] tez.DagUtils: Vertex has custom input? false
2018-08-07T18:22:56,589 INFO [main] exec.SerializationUtilities: Serializing MapWork using kryo
2018-08-07T18:22:56,601 INFO [main] exec.Utilities: Setting plan: /tmp/hive/scratch/hive/a501276d-2015-435b-85c5-4d40534ac162/hive_2018-08-07_18-22-53_167_2618699013418541798-1/hive/_tez_scratch_dir/d5cc1718-38b1-49ba-a97e-ab9f78415b62/map.xml
2018-08-07T18:22:56,669 INFO [main] fs.FSStatsPublisher: created : hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10001
2018-08-07T18:22:56,686 INFO [main] client.TezClient: Submitting dag to TezSession, sessionName=HIVE-a501276d-2015-435b-85c5-4d40534ac162, applicationId=application_1533623337748_0376, dagName=insert into default.t...db.temp_large_table3(Stage-1), callerContext={ context=HIVE, callerType=HIVE_QUERY_ID, callerId=hive_20180807182253_52487095-48c1-4847-92cd-6e60121e8ae2 }
2018-08-07T18:22:57,206 INFO [main] client.TezClient: Submitted dag to TezSession, sessionName=HIVE-a501276d-2015-435b-85c5-4d40534ac162, applicationId=application_1533623337748_0376, dagId=dag_1533623337748_0376_1, dagName=insert into default.t...db.temp_large_table3(Stage-1)
2018-08-07T18:22:57,277 INFO [main] SessionState:
2018-08-07T18:22:57,719 INFO [main] SessionState: Status: Running (Executing on YARN cluster with App id application_1533623337748_0376)
2018-08-07T18:22:57,721 INFO [main] SessionState: Map 1: 0/165
2018-08-07T18:23:00,542 INFO [main] SessionState: Map 1: 0(+1)/165
2018-08-07T18:23:01,551 INFO [main] SessionState: Map 1: 0(+2)/165
:
:
2018-08-07T18:30:01,688 INFO [main] SessionState: Map 1: 165/165
2018-08-07T18:30:01,713 INFO [main] counters.Limits: Counter limits initialized with parameters: GROUP_NAME_MAX=256, MAX_GROUPS=500, COUNTER_NAME_MAX=64, MAX_COUNTERS=1200
2018-08-07T18:30:01,726 INFO [main] exec.FileSinkOperator: Moving tmp dir: hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/_tmp.-ext-10000 to: hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10000
2018-08-07T18:30:01,796 INFO [main] ql.Driver: Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
2018-08-07T18:30:01,796 INFO [main] ql.Driver: Starting task [Stage-0:MOVE] in serial mode
2018-08-07T18:30:01,797 INFO [main] exec.Task: Loading data to table default.temp_tro1 from hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10000
2018-08-07T18:30:11,683 WARN [main] serde2.AbstractEncodingAwareSerDe: The data may not be properly converted to target charset ISO-8859-1
2018-08-07T18:30:11,759 INFO [main] ql.Driver: Starting task [Stage-3:STATS] in serial mode
2018-08-07T18:30:11,759 INFO [main] exec.StatsTask: Executing stats task
2018-08-07T18:30:11,891 INFO [main] fs.FSStatsPublisher: created : hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10001
2018-08-07T18:30:11,985 INFO [main] fs.FSStatsAggregator: Read stats : {default.temp_tro1/={rawDataSize=133428373, numRows=789517}}
2018-08-07T18:30:12,003 INFO [main] fs.FSStatsAggregator: Read stats : {default.temp_tro1/={rawDataSize=133428204, numRows=789516}}
`
====
**Beeline log:**
`2018-08-07T16:29:13,856 INFO [HiveServer2-Background-Pool: Thread-1549] ql.Context: New scratch dir is hdfs://edhcluster/tmp/hive/scratch/hive/0887b266-675a
-4fb2-8c85-3a27ebb3b9fc/hive_2018-08-07_16-29-12_750_8973639287951385407-3
2018-08-07T16:29:13,900 INFO [HiveServer2-Background-Pool: Thread-1549] tez.DagUtils: Vertex has custom input? false
2018-08-07T16:29:13,901 INFO [HiveServer2-Background-Pool: Thread-1549] exec.SerializationUtilities: Serializing MapWork using kryo
2018-08-07T16:29:13,903 INFO [HiveServer2-Background-Pool: Thread-1549] exec.Utilities: Setting plan: /tmp/hive/scratch/hive/0887b266-675a-4fb2-8c85-3a27ebb
3b9fc/hive_2018-08-07_16-29-12_750_8973639287951385407-3/hive/_tez_scratch_dir/6f4620d8-310c-4aff-bbe8-6f69ea9d1341/map.xml
2018-08-07T16:29:13,934 INFO [HiveServer2-Background-Pool: Thread-1549] fs.FSStatsPublisher: created : hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16
-29-12_750_8973639287951385407-1/-ext-10001
2018-08-07T16:29:13,938 INFO [HiveServer2-Background-Pool: Thread-1549] client.TezClient: Submitting dag to TezSession, sessionName=HIVE-e2dfe4df-37f0-4d95-
946d-30557075f807, applicationId=application_1533623337748_0148, dagName=insert into default.t...db.temp_large_table3(Stage-1), callerContext={ context=HIVE,
callerType=HIVE_QUERY_ID, callerId=hive_20180807162912_519c1503-c151-4da7-b5a2-bd067e9c42b9 }
2018-08-07T16:29:14,127 INFO [HiveServer2-Background-Pool: Thread-1549] client.TezClient: Submitted dag to TezSession, sessionName=HIVE-e2dfe4df-37f0-4d95-9
46d-30557075f807, applicationId=application_1533623337748_0148, dagId=dag_1533623337748_0148_2, dagName=insert into default.t...db.temp_large_table3(Stage-1)
2018-08-07T16:29:14,185 INFO [HiveServer2-Background-Pool: Thread-1549] SessionState:
2018-08-07T16:29:14,390 INFO [HiveServer2-Background-Pool: Thread-1549] SessionState: Status: Running (Executing on YARN cluster with App id application_153
3623337748_0148)
2018-08-07T16:29:14,390 INFO [HiveServer2-Background-Pool: Thread-1549] SessionState: Map 1: 0/171
2018-08-07T16:29:16,600 INFO [HiveServer2-Background-Pool: Thread-1549] SessionState: Map 1: 0(+3)/171
:
:
2018-08-07T16:35:57,955 INFO [HiveServer2-Background-Pool: Thread-1549] SessionState: Map 1: 171/171
2018-08-07T16:35:57,963 INFO [HiveServer2-Background-Pool: Thread-1549] exec.FileSinkOperator: Moving tmp dir: hdfs://edhcluster/tmp/hive/staging_hive_2018-
08-07_16-29-12_750_8973639287951385407-1/_tmp.-ext-10000 to: hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16-29-12_750_8973639287951385407-1/-ext-10000
2018-08-07T16:35:57,996 INFO [HiveServer2-Background-Pool: Thread-1549] ql.Driver: Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
2018-08-07T16:35:57,996 INFO [HiveServer2-Background-Pool: Thread-1549] ql.Driver: Starting task [Stage-0:MOVE] in serial mode
2018-08-07T16:35:57,996 INFO [HiveServer2-Background-Pool: Thread-1549] exec.Task: Loading data to table default.temp_tro from hdfs://edhcluster/tmp/hive/st
aging_hive_2018-08-07_16-29-12_750_8973639287951385407-1/-ext-10000
2018-08-07T16:35:58,158 INFO [HiveServer2-Background-Pool: Thread-1549] metadata.Hive: Copying source hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16-
29-12_750_8973639287951385407-1/-ext-10000/000000_0 to hdfs://edhcluster/user/hive/warehouse/temp_tro/000000_0 because HDFS encryption zones are different.
2018-08-07T16:35:58,158 INFO [HiveServer2-Background-Pool: Thread-1549] common.FileUtils: Source is 129368810 bytes. (MAX: 33554432)
2018-08-07T16:35:58,158 INFO [HiveServer2-Background-Pool: Thread-1549] common.FileUtils: Launch distributed copy (distcp) job.
2018-08-07T16:35:58,239 INFO [HiveServer2-Background-Pool: Thread-1549] tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, dele
teMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://edhcl
uster/tmp/hive/staging_hive_2018-08-07_16-29-12_750_8973639287951385407-1/-ext-10000/000000_0], targetPath=hdfs://edhcluster/user/hive/warehouse/temp_tro/000
000_0, targetPathExists=false, preserveRawXattrs=false}
2018-08-07T16:35:58,288 INFO [HiveServer2-Background-Pool: Thread-1549] hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 10139 for hive on ha-hdfs:edhclu
ster
2018-08-07T16:35:58,295 INFO [HiveServer2-Background-Pool: Thread-1549] security.TokenCache: Got dt for hdfs://edhcluster; Kind: HDFS_DELEGATION_TOKEN, Serv
ice: ha-hdfs:edhcluster, Ident: (HDFS_DELEGATION_TOKEN token 10139 for hive)
2018-08-07T16:35:58,295 WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind kms-dt
2018-08-07T16:35:58,295 INFO [HiveServer2-Background-Pool: Thread-1549] security.TokenCache: Got dt for hdfs://edhcluster; Kind: kms-dt, Service: 160.88.112
.163:9393, Ident: 00 04 68 69 76 65 04 79 61 72 6e 04 68 69 76 65 8a 01 65 13 87 4a d6 8a 01 65 37 93 ce d6 8e 1d f6 24
2018-08-07T16:35:58,738 WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind kms-dt
2018-08-07T16:35:58,738 INFO [HiveServer2-Background-Pool: Thread-1549] security.TokenCache: Got dt for hdfs://edhcluster; Kind: kms-dt, Service: 160.88.112
.162:9393, Ident: 00 04 68 69 76 65 04 79 61 72 6e 04 68 69 76 65 8a 01 65 13 87 4c 91 8a 01 65 37 93 d0 91 8e 1d 49 23
2018-08-07T16:35:58,745 WARN [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Imp
lement the Tool interface and execute your application with ToolRunner to remedy this.
2018-08-07T16:35:59,000 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:35:59,000 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:35:59,130 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: number of splits:1
2018-08-07T16:35:59,313 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Submitting tokens for job: job_1533623337748_0215
2018-08-07T16:35:59,313 WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind kms-dt
2018-08-07T16:35:59,313 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Kind: kms-dt, Service: 160.88.112.162:9393, Ident: 00 04 68
69 76 65 04 79 61 72 6e 04 68 69 76 65 8a 01 65 13 87 4c 91 8a 01 65 37 93 d0 91 8e 1d 49 23
2018-08-07T16:35:59,313 WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind kms-dt
2018-08-07T16:35:59,313 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Kind: kms-dt, Service: 160.88.112.163:9393, Ident: 00 04 68
69 76 65 04 79 61 72 6e 04 68 69 76 65 8a 01 65 13 87 4a d6 8a 01 65 37 93 ce d6 8e 1d f6 24
2018-08-07T16:35:59,313 WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind HIVE_DELEGATION_TOKEN
2018-08-07T16:35:59,313 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Kind: HIVE_DELEGATION_TOKEN, Service: HiveServer2Impersonati
onToken, Ident: 00 04 68 69 76 65 04 68 69 76 65 00 8a 01 65 13 76 fe e0 8a 01 65 37 83 82 e0 01 8e 02 73
2018-08-07T16:35:59,313 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:edhcluster, Id
ent: (HDFS_DELEGATION_TOKEN token 10139 for hive)
2018-08-07T16:35:59,520 INFO [HiveServer2-Background-Pool: Thread-1549] impl.YarnClientImpl: Submitted application application_1533623337748_0215
2018-08-07T16:35:59,521 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: The url to track the job: https://bdtst1a.server:8888
/proxy/application_1533623337748_0215/
2018-08-07T16:35:59,521 INFO [HiveServer2-Background-Pool: Thread-1549] tools.DistCp: DistCp job-id: job_1533623337748_0215
2018-08-07T16:35:59,521 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: Running job: job_1533623337748_0215
2018-08-07T16:36:04,002 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:04,002 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:06,571 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: Job job_1533623337748_0215 running in uber mode : false
2018-08-07T16:36:06,571 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: map 0% reduce 0%
2018-08-07T16:36:09,004 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:09,004 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:14,006 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:14,006 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:18,603 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: map 100% reduce 0%
2018-08-07T16:36:19,007 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:19,007 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:24,009 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:24,009 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:29,010 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:29,010 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:34,012 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
rver2-Handler-Pool: Thread-53
2018-08-07T16:36:34,012 INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to HiveServer2-Handler-Pool: Thread-53
2018-08-07T16:36:37,641 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: Job job_1533623337748_0215 completed successfully
2018-08-07T16:36:37,664 INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=297867
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=129369232
HDFS: Number of bytes written=129368810
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=85701
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=28567
Total vcore-milliseconds taken by all map tasks=28567
Total megabyte-milliseconds taken by all map tasks=571340000
Map-Reduce Framework
Map input records=1
Map output records=0
Input split bytes=134
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=84
CPU time spent (ms)=31920
Physical memory (bytes) snapshot=542441472
Virtual memory (bytes) snapshot=19353210880
Total committed heap usage (bytes)=826277888
File Input Format Counters
Bytes Read=288
File Output Format Counters
Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=129368810
BYTESEXPECTED=129368810
COPY=1
2018-08-07T16:36:37,706 INFO [HiveServer2-Background-Pool: Thread-1549] metadata.Hive: Copying source hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16-
29-12_750_8973639287951385407-1/-ext-10000/000001_0 to hdfs://edhcluster/user/hive/warehouse/temp_tro/000001_0 because HDFS encryption zones are different.
2018-08-07T16:36:37,706 INFO [HiveServer2-Background-Pool: Thread-1549] common.FileUtils: Source is 129368980 bytes. (MAX: 33554432)
2018-08-07T16:36:37,706 INFO [HiveServer2-Background-Pool: Thread-1549] common.FileUtils: Launch distributed copy (distcp) job.
2018-08-07T16:36:37,783 INFO [HiveServer2-Background-Pool: Thread-1549] tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, dele
teMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://edhcl
uster/tmp/hive/staging_hive_2018-08-07_16-29-12_750_8973639287951385407-1/-ext-10000/000001_0], targetPath=hdfs://edhcluster/user/hive/warehouse/temp_tro/000
001_0, targetPathExists=false, preserveRawXattrs=false}
`
:
:
更新:
发现HIVE CLI的hive用户使用了.hiverc,所以发现了差异。
hive.exec.scratchdir=/user/hive/scratch
hive.exec.stagingdir=/user/hive/staging
问题是 hdfs /user/hive 目录是用 Ranger 加密的,而 hdfs /tmp/hive 目录是未加密的,hadoop 组中的所有用户都可以 read/write。
我测试了会话级别更改的直线。执行速度与 HIVE CLI 一样快。
hive.exec.scratchdir=/user/hive/scratch
hive.exec.stagingdir=/user/hive/staging
我使用会话级别更改的 HIVE CLI 进行了测试。使用 MAP reduce 作业移动数据时执行速度很慢。
hive.exec.scratchdir=/tmp/hive/scratch
hive.exec.stagingdir=/tmp/hive/staging
所以根本原因是数据在 /user/hive 中加密而不是在 /tmp/hive 中加密。
解决方案是更改会话级别以使用相同的加密区域。
因此如果加密区域不同,将打印以下信息日志。
metadata.Hive: Copying source hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16- 29-12_750_8973639287951385407-1/-ext-10000/000001_0 to hdfs://edhcluster/user/hive/warehouse/temp_tro/000001_0 because HDFS encryption zones are different.
谢谢,
曼吉尔