Hive CLI 和 Beeline jdbc:hive2 在插入百万条记录的执行引擎 tez 中表现不同?

Hive CLI and Beeline jdbc:hive2 behave differently in execution engine tez for insert million records?

从具有数百万条记录(20GB 大小)的大型 table 执行插入到空 table 时。在 hive CLI 和 beeline 中执行不同。

Hive CLI:它在 Yarn 中创建两个 TEZ 作业,可能是 mapper 和 reducer,并在大约 413 秒内完成。

Beeline:它在 Yarn 中创建第一个 TEZ 作业,其他是 MapReduce 作业,超过 150 个作业,耗时将近 2 小时。

这是 TEZ 作业的 hiveserver2 直线的预期行为,因为它在内部创建 MapReduce 作业?

环境详情:

hive 常用设置:

直线设置:

提前致谢。

    **Hive CLI log:**

    `2018-08-07T18:22:56,482  INFO [main] exec.Task: Dag name: insert into default.t...db.temp_large_table3(Stage-1)
    2018-08-07T18:22:56,493  INFO [main] ql.Context: New scratch dir is hdfs://edhcluster/tmp/hive/scratch/hive/a501276d-2015-435b-85c5-4d40534ac162/hive_2018-08-07_18-22-53_167_2618699013418541798-1
    2018-08-07T18:22:56,532  INFO [main] tez.DagUtils: Vertex has custom input? false
    2018-08-07T18:22:56,589  INFO [main] exec.SerializationUtilities: Serializing MapWork using kryo
    2018-08-07T18:22:56,601  INFO [main] exec.Utilities: Setting plan: /tmp/hive/scratch/hive/a501276d-2015-435b-85c5-4d40534ac162/hive_2018-08-07_18-22-53_167_2618699013418541798-1/hive/_tez_scratch_dir/d5cc1718-38b1-49ba-a97e-ab9f78415b62/map.xml
    2018-08-07T18:22:56,669  INFO [main] fs.FSStatsPublisher: created : hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10001
    2018-08-07T18:22:56,686  INFO [main] client.TezClient: Submitting dag to TezSession, sessionName=HIVE-a501276d-2015-435b-85c5-4d40534ac162, applicationId=application_1533623337748_0376, dagName=insert into default.t...db.temp_large_table3(Stage-1), callerContext={ context=HIVE, callerType=HIVE_QUERY_ID, callerId=hive_20180807182253_52487095-48c1-4847-92cd-6e60121e8ae2 }
    2018-08-07T18:22:57,206  INFO [main] client.TezClient: Submitted dag to TezSession, sessionName=HIVE-a501276d-2015-435b-85c5-4d40534ac162, applicationId=application_1533623337748_0376, dagId=dag_1533623337748_0376_1, dagName=insert into default.t...db.temp_large_table3(Stage-1)
    2018-08-07T18:22:57,277  INFO [main] SessionState:

    2018-08-07T18:22:57,719  INFO [main] SessionState: Status: Running (Executing on YARN cluster with App id application_1533623337748_0376)

    2018-08-07T18:22:57,721  INFO [main] SessionState: Map 1: 0/165
    2018-08-07T18:23:00,542  INFO [main] SessionState: Map 1: 0(+1)/165
    2018-08-07T18:23:01,551  INFO [main] SessionState: Map 1: 0(+2)/165
    :
    :
    2018-08-07T18:30:01,688  INFO [main] SessionState: Map 1: 165/165
    2018-08-07T18:30:01,713  INFO [main] counters.Limits: Counter limits initialized with parameters:  GROUP_NAME_MAX=256, MAX_GROUPS=500, COUNTER_NAME_MAX=64, MAX_COUNTERS=1200
    2018-08-07T18:30:01,726  INFO [main] exec.FileSinkOperator: Moving tmp dir: hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/_tmp.-ext-10000 to: hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10000
    2018-08-07T18:30:01,796  INFO [main] ql.Driver: Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
    2018-08-07T18:30:01,796  INFO [main] ql.Driver: Starting task [Stage-0:MOVE] in serial mode
    2018-08-07T18:30:01,797  INFO [main] exec.Task: Loading data to table default.temp_tro1 from hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10000
    2018-08-07T18:30:11,683  WARN [main] serde2.AbstractEncodingAwareSerDe: The data may not be properly converted to target charset ISO-8859-1
    2018-08-07T18:30:11,759  INFO [main] ql.Driver: Starting task [Stage-3:STATS] in serial mode
    2018-08-07T18:30:11,759  INFO [main] exec.StatsTask: Executing stats task
    2018-08-07T18:30:11,891  INFO [main] fs.FSStatsPublisher: created : hdfs://edhcluster/user/hive/staging_hive_2018-08-07_18-22-53_167_2618699013418541798-1/-ext-10001
    2018-08-07T18:30:11,985  INFO [main] fs.FSStatsAggregator: Read stats : {default.temp_tro1/={rawDataSize=133428373, numRows=789517}}
    2018-08-07T18:30:12,003  INFO [main] fs.FSStatsAggregator: Read stats : {default.temp_tro1/={rawDataSize=133428204, numRows=789516}}
    `
    ====

    **Beeline log:**

    `2018-08-07T16:29:13,856  INFO [HiveServer2-Background-Pool: Thread-1549] ql.Context: New scratch dir is hdfs://edhcluster/tmp/hive/scratch/hive/0887b266-675a
    -4fb2-8c85-3a27ebb3b9fc/hive_2018-08-07_16-29-12_750_8973639287951385407-3
    2018-08-07T16:29:13,900  INFO [HiveServer2-Background-Pool: Thread-1549] tez.DagUtils: Vertex has custom input? false
    2018-08-07T16:29:13,901  INFO [HiveServer2-Background-Pool: Thread-1549] exec.SerializationUtilities: Serializing MapWork using kryo
    2018-08-07T16:29:13,903  INFO [HiveServer2-Background-Pool: Thread-1549] exec.Utilities: Setting plan: /tmp/hive/scratch/hive/0887b266-675a-4fb2-8c85-3a27ebb
    3b9fc/hive_2018-08-07_16-29-12_750_8973639287951385407-3/hive/_tez_scratch_dir/6f4620d8-310c-4aff-bbe8-6f69ea9d1341/map.xml
    2018-08-07T16:29:13,934  INFO [HiveServer2-Background-Pool: Thread-1549] fs.FSStatsPublisher: created : hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16
    -29-12_750_8973639287951385407-1/-ext-10001
    2018-08-07T16:29:13,938  INFO [HiveServer2-Background-Pool: Thread-1549] client.TezClient: Submitting dag to TezSession, sessionName=HIVE-e2dfe4df-37f0-4d95-
    946d-30557075f807, applicationId=application_1533623337748_0148, dagName=insert into default.t...db.temp_large_table3(Stage-1), callerContext={ context=HIVE,
     callerType=HIVE_QUERY_ID, callerId=hive_20180807162912_519c1503-c151-4da7-b5a2-bd067e9c42b9 }
    2018-08-07T16:29:14,127  INFO [HiveServer2-Background-Pool: Thread-1549] client.TezClient: Submitted dag to TezSession, sessionName=HIVE-e2dfe4df-37f0-4d95-9
    46d-30557075f807, applicationId=application_1533623337748_0148, dagId=dag_1533623337748_0148_2, dagName=insert into default.t...db.temp_large_table3(Stage-1)
    2018-08-07T16:29:14,185  INFO [HiveServer2-Background-Pool: Thread-1549] SessionState:

    2018-08-07T16:29:14,390  INFO [HiveServer2-Background-Pool: Thread-1549] SessionState: Status: Running (Executing on YARN cluster with App id application_153
    3623337748_0148)

    2018-08-07T16:29:14,390  INFO [HiveServer2-Background-Pool: Thread-1549] SessionState: Map 1: 0/171
    2018-08-07T16:29:16,600  INFO [HiveServer2-Background-Pool: Thread-1549] SessionState: Map 1: 0(+3)/171
    :
    :
    2018-08-07T16:35:57,955  INFO [HiveServer2-Background-Pool: Thread-1549] SessionState: Map 1: 171/171
    2018-08-07T16:35:57,963  INFO [HiveServer2-Background-Pool: Thread-1549] exec.FileSinkOperator: Moving tmp dir: hdfs://edhcluster/tmp/hive/staging_hive_2018-
    08-07_16-29-12_750_8973639287951385407-1/_tmp.-ext-10000 to: hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16-29-12_750_8973639287951385407-1/-ext-10000
    2018-08-07T16:35:57,996  INFO [HiveServer2-Background-Pool: Thread-1549] ql.Driver: Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
    2018-08-07T16:35:57,996  INFO [HiveServer2-Background-Pool: Thread-1549] ql.Driver: Starting task [Stage-0:MOVE] in serial mode
    2018-08-07T16:35:57,996  INFO [HiveServer2-Background-Pool: Thread-1549] exec.Task: Loading data to table default.temp_tro from hdfs://edhcluster/tmp/hive/st
    aging_hive_2018-08-07_16-29-12_750_8973639287951385407-1/-ext-10000
    2018-08-07T16:35:58,158  INFO [HiveServer2-Background-Pool: Thread-1549] metadata.Hive: Copying source hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16-
    29-12_750_8973639287951385407-1/-ext-10000/000000_0 to hdfs://edhcluster/user/hive/warehouse/temp_tro/000000_0 because HDFS encryption zones are different.
    2018-08-07T16:35:58,158  INFO [HiveServer2-Background-Pool: Thread-1549] common.FileUtils: Source is 129368810 bytes. (MAX: 33554432)
    2018-08-07T16:35:58,158  INFO [HiveServer2-Background-Pool: Thread-1549] common.FileUtils: Launch distributed copy (distcp) job.
    2018-08-07T16:35:58,239  INFO [HiveServer2-Background-Pool: Thread-1549] tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, dele
    teMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://edhcl
    uster/tmp/hive/staging_hive_2018-08-07_16-29-12_750_8973639287951385407-1/-ext-10000/000000_0], targetPath=hdfs://edhcluster/user/hive/warehouse/temp_tro/000
    000_0, targetPathExists=false, preserveRawXattrs=false}
    2018-08-07T16:35:58,288  INFO [HiveServer2-Background-Pool: Thread-1549] hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 10139 for hive on ha-hdfs:edhclu
    ster
    2018-08-07T16:35:58,295  INFO [HiveServer2-Background-Pool: Thread-1549] security.TokenCache: Got dt for hdfs://edhcluster; Kind: HDFS_DELEGATION_TOKEN, Serv
    ice: ha-hdfs:edhcluster, Ident: (HDFS_DELEGATION_TOKEN token 10139 for hive)
    2018-08-07T16:35:58,295  WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind kms-dt
    2018-08-07T16:35:58,295  INFO [HiveServer2-Background-Pool: Thread-1549] security.TokenCache: Got dt for hdfs://edhcluster; Kind: kms-dt, Service: 160.88.112
    .163:9393, Ident: 00 04 68 69 76 65 04 79 61 72 6e 04 68 69 76 65 8a 01 65 13 87 4a d6 8a 01 65 37 93 ce d6 8e 1d f6 24
    2018-08-07T16:35:58,738  WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind kms-dt
    2018-08-07T16:35:58,738  INFO [HiveServer2-Background-Pool: Thread-1549] security.TokenCache: Got dt for hdfs://edhcluster; Kind: kms-dt, Service: 160.88.112
    .162:9393, Ident: 00 04 68 69 76 65 04 79 61 72 6e 04 68 69 76 65 8a 01 65 13 87 4c 91 8a 01 65 37 93 d0 91 8e 1d 49 23
    2018-08-07T16:35:58,745  WARN [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Imp
    lement the Tool interface and execute your application with ToolRunner to remedy this.
    2018-08-07T16:35:59,000  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
    rver2-Handler-Pool: Thread-53
    2018-08-07T16:35:59,000  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to  HiveServer2-Handler-Pool: Thread-53
    2018-08-07T16:35:59,130  INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: number of splits:1
    2018-08-07T16:35:59,313  INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Submitting tokens for job: job_1533623337748_0215
    2018-08-07T16:35:59,313  WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind kms-dt
    2018-08-07T16:35:59,313  INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Kind: kms-dt, Service: 160.88.112.162:9393, Ident: 00 04 68
    69 76 65 04 79 61 72 6e 04 68 69 76 65 8a 01 65 13 87 4c 91 8a 01 65 37 93 d0 91 8e 1d 49 23
    2018-08-07T16:35:59,313  WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind kms-dt
    2018-08-07T16:35:59,313  INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Kind: kms-dt, Service: 160.88.112.163:9393, Ident: 00 04 68
    69 76 65 04 79 61 72 6e 04 68 69 76 65 8a 01 65 13 87 4a d6 8a 01 65 37 93 ce d6 8e 1d f6 24
    2018-08-07T16:35:59,313  WARN [HiveServer2-Background-Pool: Thread-1549] token.Token: Cannot find class for token kind HIVE_DELEGATION_TOKEN
    2018-08-07T16:35:59,313  INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Kind: HIVE_DELEGATION_TOKEN, Service: HiveServer2Impersonati
    onToken, Ident: 00 04 68 69 76 65 04 68 69 76 65 00 8a 01 65 13 76 fe e0 8a 01 65 37 83 82 e0 01 8e 02 73
    2018-08-07T16:35:59,313  INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:edhcluster, Id
    ent: (HDFS_DELEGATION_TOKEN token 10139 for hive)
    2018-08-07T16:35:59,520  INFO [HiveServer2-Background-Pool: Thread-1549] impl.YarnClientImpl: Submitted application application_1533623337748_0215
    2018-08-07T16:35:59,521  INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: The url to track the job: https://bdtst1a.server:8888
    /proxy/application_1533623337748_0215/
    2018-08-07T16:35:59,521  INFO [HiveServer2-Background-Pool: Thread-1549] tools.DistCp: DistCp job-id: job_1533623337748_0215
    2018-08-07T16:35:59,521  INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: Running job: job_1533623337748_0215
    2018-08-07T16:36:04,002  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
    rver2-Handler-Pool: Thread-53
    2018-08-07T16:36:04,002  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to  HiveServer2-Handler-Pool: Thread-53
    2018-08-07T16:36:06,571  INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: Job job_1533623337748_0215 running in uber mode : false
    2018-08-07T16:36:06,571  INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job:  map 0% reduce 0%
    2018-08-07T16:36:09,004  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
    rver2-Handler-Pool: Thread-53
    2018-08-07T16:36:09,004  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to  HiveServer2-Handler-Pool: Thread-53
    2018-08-07T16:36:14,006  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
    rver2-Handler-Pool: Thread-53
    2018-08-07T16:36:14,006  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to  HiveServer2-Handler-Pool: Thread-53
    2018-08-07T16:36:18,603  INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job:  map 100% reduce 0%
    2018-08-07T16:36:19,007  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
    rver2-Handler-Pool: Thread-53
    2018-08-07T16:36:19,007  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to  HiveServer2-Handler-Pool: Thread-53
    2018-08-07T16:36:24,009  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
    rver2-Handler-Pool: Thread-53
    2018-08-07T16:36:24,009  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to  HiveServer2-Handler-Pool: Thread-53
    2018-08-07T16:36:29,010  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
    rver2-Handler-Pool: Thread-53
    2018-08-07T16:36:29,010  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to  HiveServer2-Handler-Pool: Thread-53
    2018-08-07T16:36:34,012  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Updating thread name to 0887b266-675a-4fb2-8c85-3a27ebb3b9fc HiveSe
    rver2-Handler-Pool: Thread-53
    2018-08-07T16:36:34,012  INFO [HiveServer2-Handler-Pool: Thread-53] session.SessionState: Resetting thread name to  HiveServer2-Handler-Pool: Thread-53
    2018-08-07T16:36:37,641  INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: Job job_1533623337748_0215 completed successfully
    2018-08-07T16:36:37,664  INFO [HiveServer2-Background-Pool: Thread-1549] mapreduce.Job: Counters: 33
            File System Counters
                    FILE: Number of bytes read=0
                    FILE: Number of bytes written=297867
                    FILE: Number of read operations=0
                    FILE: Number of large read operations=0
                    FILE: Number of write operations=0
                    HDFS: Number of bytes read=129369232
                    HDFS: Number of bytes written=129368810
                    HDFS: Number of read operations=16
                    HDFS: Number of large read operations=0
                    HDFS: Number of write operations=4
            Job Counters
                    Launched map tasks=1
                    Other local map tasks=1
                    Total time spent by all maps in occupied slots (ms)=85701
                    Total time spent by all reduces in occupied slots (ms)=0
                    Total time spent by all map tasks (ms)=28567
                    Total vcore-milliseconds taken by all map tasks=28567
                    Total megabyte-milliseconds taken by all map tasks=571340000
            Map-Reduce Framework
                    Map input records=1
                    Map output records=0
                    Input split bytes=134
                    Spilled Records=0
                    Failed Shuffles=0
                    Merged Map outputs=0
                    GC time elapsed (ms)=84
                    CPU time spent (ms)=31920
                    Physical memory (bytes) snapshot=542441472
                    Virtual memory (bytes) snapshot=19353210880
                    Total committed heap usage (bytes)=826277888
            File Input Format Counters
                    Bytes Read=288
            File Output Format Counters
                    Bytes Written=0
            org.apache.hadoop.tools.mapred.CopyMapper$Counter
                    BYTESCOPIED=129368810
                    BYTESEXPECTED=129368810
                    COPY=1
    2018-08-07T16:36:37,706  INFO [HiveServer2-Background-Pool: Thread-1549] metadata.Hive: Copying source hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16-
    29-12_750_8973639287951385407-1/-ext-10000/000001_0 to hdfs://edhcluster/user/hive/warehouse/temp_tro/000001_0 because HDFS encryption zones are different.
    2018-08-07T16:36:37,706  INFO [HiveServer2-Background-Pool: Thread-1549] common.FileUtils: Source is 129368980 bytes. (MAX: 33554432)
    2018-08-07T16:36:37,706  INFO [HiveServer2-Background-Pool: Thread-1549] common.FileUtils: Launch distributed copy (distcp) job.
    2018-08-07T16:36:37,783  INFO [HiveServer2-Background-Pool: Thread-1549] tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, dele
    teMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://edhcl
    uster/tmp/hive/staging_hive_2018-08-07_16-29-12_750_8973639287951385407-1/-ext-10000/000001_0], targetPath=hdfs://edhcluster/user/hive/warehouse/temp_tro/000
    001_0, targetPathExists=false, preserveRawXattrs=false}
    `
    :
    :

更新:

发现HIVE CLI的hive用户使用了.hiverc,所以发现了差异。

hive.exec.scratchdir=/user/hive/scratch

hive.exec.stagingdir=/user/hive/staging

问题是 hdfs /user/hive 目录是用 Ranger 加密的,而 hdfs /tmp/hive 目录是未加密的,hadoop 组中的所有用户都可以 read/write。

我测试了会话级别更改的直线。执行速度与 HIVE CLI 一样快。

 hive.exec.scratchdir=/user/hive/scratch

 hive.exec.stagingdir=/user/hive/staging

我使用会话级别更改的 HIVE CLI 进行了测试。使用 MAP reduce 作业移动数据时执行速度很慢。

 hive.exec.scratchdir=/tmp/hive/scratch

 hive.exec.stagingdir=/tmp/hive/staging

所以根本原因是数据在 /user/hive 中加密而不是在 /tmp/hive 中加密。

解决方案是更改会话级别以使用相同的加密区域。

因此如果加密区域不同,将打印以下信息日志。

  metadata.Hive: Copying source hdfs://edhcluster/tmp/hive/staging_hive_2018-08-07_16- 29-12_750_8973639287951385407-1/-ext-10000/000001_0 to hdfs://edhcluster/user/hive/warehouse/temp_tro/000001_0 because HDFS encryption zones are different.

谢谢,

曼吉尔