无法在 cloudera 的 Hive 中创建桶

Unable to create buckets in Hive in cloudera

我正在尝试在 Cloudera 的 Hive 中创建分桶 table。但是,正常的 table 是在没有任何桶的情况下创建的。

首先,我使用 Hive CLI

使用命名 marks_temp 创建了一个普通的 table
CREATE  TABLE marks_temp(
id INT,
Name string,
mark int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

我已将以下数据从文本文件 'Desktop/Data/littlebigdata.txt'

加载到 marks_temp table
101,Firdaus,88
102,Pranav,78
103,Rahul,65
104,Sanjoy,65
105,Firdaus,88
106,Pranav,78
107,Rahul,65
108,Sanjoy,65
109,Amar,54
110,Sahil,34
111,Rahul,45
112,Rajnish,67
113,Ranjeet,56
114,Sanjoy,34 

我已经使用以下命令加载了以上数据

LOAD DATA LOCAL INPATH 'Desktop/Data/littlebigdata.txt'
INTO TABLE  marks_temp;

成功加载数据后,我正在创建一个名为 marks_temp

的分桶 table
CREATE TABLE marks_bucketed(
id INT,
Name string,
mark int
)
CLUSTERED BY (id) INTO 4 BUCKETS;

现在,我在 marks_bucketed table 中插入来自 marks_temp table 的数据。

INSERT INTO marks_bucketed
SELECT id,Name, mark FROM marks_temp;

在此之后,一些作业开始 运行。什么,我在作业日志中观察到它说 "Number of reduce tasks is set to 0 since there's no reduce operator"

hive> insert into marks_bucketed
        > select id,Name,mark from marks_temp;
    Query ID = cloudera_20180601035353_29b25ffe-541e-491e-aea6-b36ede88ed79
    Total jobs = 3
    Launching Job 1 out of 3
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_1527668582032_0004, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1527668582032_0004/
    Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1527668582032_0004
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    2018-06-01 03:54:01,328 Stage-1 map = 0%,  reduce = 0%
    2018-06-01 03:54:14,444 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.21 sec
    MapReduce Total cumulative CPU time: 2 seconds 210 msec
    Ended Job = job_1527668582032_0004
    Stage-4 is selected by condition resolver.
    Stage-3 is filtered out by condition resolver.
    Stage-5 is filtered out by condition resolver.
    Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/marks_bucketed/.hive-staging_hive_2018-06-01_03-53-45_726_2788383119636056364-1/-ext-10000
    Loading data to table default.marks_bucketed
    Table default.marks_bucketed stats: [numFiles=1, numRows=14, totalSize=194, rawDataSize=180]
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 1   Cumulative CPU: 2.21 sec   HDFS Read: 3937 HDFS Write: 273 SUCCESS
    Total MapReduce CPU Time Spent: 2 seconds 210 msec
    OK
    Time taken: 31.307 seconds

甚至,Hue 文件浏览器只显示一个文件。附上截图。 Hue File Browser screenshot for marks_bucketed table

来自 Hive 文档

Version 0.x and 1.x only

The command set hive.enforce.bucketing = true; allows the correct number of reducers and the cluster by column to be automatically selected based on the table. Otherwise, you would need to set the number of reducers to be the same as the number of buckets as in set mapred.reduce.tasks = 256; and have a CLUSTER BY ... clause in the select.

因此您需要设置 属性 以强制分桶或使用手动选项,运行 您的查询如

set mapred.reduce.tasks = 4;
select id,Name,mark from marks_temp cluster by id;