在配置单元中加载结构或任何其他复杂数据类型
load struct or any other complex data type in hive
我有一个 .xlsx
文件,其中包含类似于下图的数据,我正在尝试使用下面的创建查询创建
CREATE TABLE aus_aboriginal(
code int,
area_name string,
male_0_4 STRUCT<num:double, total:double, perc:double>,
male_5_9 STRUCT<num:double, total:double, perc:double>,
male_10_14 STRUCT<num:double, total:double, perc:double>,
male_15_19 STRUCT<num:double, total:double, perc:double>,
male_20_24 STRUCT<num:double, total:double, perc:double>,
male_25_29 STRUCT<num:double, total:double, perc:double>,
male_30_34 STRUCT<num:double, total:double, perc:double>,
male_35_39 STRUCT<num:double, total:double, perc:double>,
male_40_44 STRUCT<num:double, total:double, perc:double>,
male_45_49 STRUCT<num:double, total:double, perc:double>,
male_50_54 STRUCT<num:double, total:double, perc:double>,
male_55_59 STRUCT<num:double, total:double, perc:double>,
male_60_64 STRUCT<num:double, total:double, perc:double>,
male_above_65 STRUCT<num:double, total:double, perc:double>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
当我将数据加载到其中时,我得到 nulls
我在 CREATE TABLE..
中缺少什么?
您还在 CREATE
语句中为结构类型添加分隔符,如下所示:
CREATE TABLE aus_aboriginal( code INT, area_name STRING,
male_0_4 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_5_9 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_10_14 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_15_19 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_20_24 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_25_29 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_30_34 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_35_39 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_40_44 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_45_49 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_50_54 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_55_59 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_60_64 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_above_65 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ':';
您可以有一个示例查询,例如:
SELECT code, male_0_4.num, male_0_4.total, male_0_4.perc FROM aus_aboriginal;
在使用像结构这样的复杂类型时,建议使用唯一的分隔符来收集,而不是用于字段(列)的分隔符。
考虑以下格式的 csv 文件,其中使用“,”逗号分隔符。
Input.csv
Code, area_name,num,total,perc,num,total,perc,num,total,perc
1100,Albury,90,444,17.4,73,546,13.4,86,546,15.8
1111,armid,40,404,14.4,97,701,13.8,76,701,10.8
预期结果是从字段(num、total 和 perc)中创建一个复杂类型:
1100,Albury,struct<90,444,17.4>,struct<73,546,13.4>,struct<86,546,15.8>
1111,armid, struct<40,404,14.4>, struct<97,701,13.8>,struct<76,701,10.8>
在这种情况下,当我们尝试使用以下配置单元查询从字段(num、total 和 perc)创建复杂类型时,我们将在 table 中获得多个空值,因为相同的“,”逗号分隔符用于字段和集合,因此 Hive 查询未能按我们的要求分隔数据。
Hive> create table aus_aboriginal( code int, area_name string, male_0_4 STRUCT<num:double, total:double, perc:double>, male_5_9 STRUCT<num:double, total:double, perc:double>, male_10_14 STRUCT<num:double, total:double, perc:double>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ',' LOCATION '/csv';
输出:
1100 Albury {"num":90.0,"total":null,"perc":null} {"num":444.0,"total":nul
l,"perc":null} {"num":17.4,"total":null,"perc":null}
1111 armid {"num":40.0,"total":null,"perc":null} {"num":404.0,"total":nul
l,"perc":null} {"num":14.4,"total":null,"perc":null}
Time taken: 0.15 seconds, Fetched: 2 row(s)
我怀疑您遇到了这个问题。
结构的使用
现在考虑具有以下格式数据的输入文件,其中“,”逗号分隔符用于字段,集合项“#”用作分隔符。
1100,Albury,90#444#17.4,73#546#13.4,86#546#15.8
1111,armid,40#404#14.4,97#701#13.8,76#701#10.8
在这种情况下,我们可以通过指定 # 作为集合项的分隔符和 , 为字段成功创建具有复杂类型的 table。请检查下面的配置单元查询。
hive> create table aus_aboriginal( code int, area_name string, male_0_4 STRUCT<num:double, total:double, perc:double>, male_5_9 STRUCT<num:double, total:double, perc:double>, male_10_14 STRUCT<num:double, total:double, perc:double>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '#' LOCATION '/csv';
输出:
hive> select * from aus_aboriginal;
1100 Albury {"num":90.0,"total":444.0,"perc":17.4} {"num":73.0,"total":546.
0,"perc":13.4} {"num":86.0,"total":546.0,"perc":15.8}
1111 armid {"num":40.0,"total":404.0,"perc":14.4} {"num":97.0,"total":701.
0,"perc":13.8} {"num":76.0,"total":701.0,"perc":10.8}
Time taken: 0.146 seconds, Fetched: 2 row(s)
其他复杂类型也应采用类似的方法,请参阅下文 link 了解更多信息。
参考:
http://edu-kinect.com/blog/2014/06/16/hive-complex-data-types-with-examples/
创建配置单元 table 使用:
CREATE TABLE `complex_data_types`(
`col1` array<string>,
`col2` map<int,string>,
`col3` struct<c1:smallint,c2:varchar(30)>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '&'
MAP KEYS TERMINATED BY '#';
注: union 可以用同样的方法取
创建一个 csv 文件:
arr1&arr2,101#map1&102#map2,11&varchar_1
arr3&arr4,103#map3&104#map4,12&varchar_2
在配置单元中加载此数据 table:
LOAD DATA LOCAL INPATH '/home/dev/complex_data.csv' into table complex_data_types;
注意:假设文件位于/home/dev/complex_data.csv
我有一个 .xlsx
文件,其中包含类似于下图的数据,我正在尝试使用下面的创建查询创建
CREATE TABLE aus_aboriginal(
code int,
area_name string,
male_0_4 STRUCT<num:double, total:double, perc:double>,
male_5_9 STRUCT<num:double, total:double, perc:double>,
male_10_14 STRUCT<num:double, total:double, perc:double>,
male_15_19 STRUCT<num:double, total:double, perc:double>,
male_20_24 STRUCT<num:double, total:double, perc:double>,
male_25_29 STRUCT<num:double, total:double, perc:double>,
male_30_34 STRUCT<num:double, total:double, perc:double>,
male_35_39 STRUCT<num:double, total:double, perc:double>,
male_40_44 STRUCT<num:double, total:double, perc:double>,
male_45_49 STRUCT<num:double, total:double, perc:double>,
male_50_54 STRUCT<num:double, total:double, perc:double>,
male_55_59 STRUCT<num:double, total:double, perc:double>,
male_60_64 STRUCT<num:double, total:double, perc:double>,
male_above_65 STRUCT<num:double, total:double, perc:double>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
当我将数据加载到其中时,我得到 nulls
CREATE TABLE..
中缺少什么?
您还在 CREATE
语句中为结构类型添加分隔符,如下所示:
CREATE TABLE aus_aboriginal( code INT, area_name STRING,
male_0_4 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_5_9 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_10_14 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_15_19 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_20_24 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_25_29 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_30_34 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_35_39 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_40_44 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_45_49 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_50_54 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_55_59 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_60_64 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_above_65 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ':';
您可以有一个示例查询,例如:
SELECT code, male_0_4.num, male_0_4.total, male_0_4.perc FROM aus_aboriginal;
在使用像结构这样的复杂类型时,建议使用唯一的分隔符来收集,而不是用于字段(列)的分隔符。 考虑以下格式的 csv 文件,其中使用“,”逗号分隔符。 Input.csv
Code, area_name,num,total,perc,num,total,perc,num,total,perc 1100,Albury,90,444,17.4,73,546,13.4,86,546,15.8
1111,armid,40,404,14.4,97,701,13.8,76,701,10.8
预期结果是从字段(num、total 和 perc)中创建一个复杂类型:
1100,Albury,struct<90,444,17.4>,struct<73,546,13.4>,struct<86,546,15.8>
1111,armid, struct<40,404,14.4>, struct<97,701,13.8>,struct<76,701,10.8>
在这种情况下,当我们尝试使用以下配置单元查询从字段(num、total 和 perc)创建复杂类型时,我们将在 table 中获得多个空值,因为相同的“,”逗号分隔符用于字段和集合,因此 Hive 查询未能按我们的要求分隔数据。
Hive> create table aus_aboriginal( code int, area_name string, male_0_4 STRUCT<num:double, total:double, perc:double>, male_5_9 STRUCT<num:double, total:double, perc:double>, male_10_14 STRUCT<num:double, total:double, perc:double>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ',' LOCATION '/csv';
输出:
1100 Albury {"num":90.0,"total":null,"perc":null} {"num":444.0,"total":nul l,"perc":null} {"num":17.4,"total":null,"perc":null}
1111 armid {"num":40.0,"total":null,"perc":null} {"num":404.0,"total":nul l,"perc":null} {"num":14.4,"total":null,"perc":null}
Time taken: 0.15 seconds, Fetched: 2 row(s)
我怀疑您遇到了这个问题。
结构的使用 现在考虑具有以下格式数据的输入文件,其中“,”逗号分隔符用于字段,集合项“#”用作分隔符。
1100,Albury,90#444#17.4,73#546#13.4,86#546#15.8
1111,armid,40#404#14.4,97#701#13.8,76#701#10.8
在这种情况下,我们可以通过指定 # 作为集合项的分隔符和 , 为字段成功创建具有复杂类型的 table。请检查下面的配置单元查询。
hive> create table aus_aboriginal( code int, area_name string, male_0_4 STRUCT<num:double, total:double, perc:double>, male_5_9 STRUCT<num:double, total:double, perc:double>, male_10_14 STRUCT<num:double, total:double, perc:double>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '#' LOCATION '/csv';
输出:
hive> select * from aus_aboriginal;
1100 Albury {"num":90.0,"total":444.0,"perc":17.4} {"num":73.0,"total":546. 0,"perc":13.4} {"num":86.0,"total":546.0,"perc":15.8}
1111 armid {"num":40.0,"total":404.0,"perc":14.4} {"num":97.0,"total":701. 0,"perc":13.8} {"num":76.0,"total":701.0,"perc":10.8}
Time taken: 0.146 seconds, Fetched: 2 row(s)
其他复杂类型也应采用类似的方法,请参阅下文 link 了解更多信息。
参考: http://edu-kinect.com/blog/2014/06/16/hive-complex-data-types-with-examples/
创建配置单元 table 使用:
CREATE TABLE `complex_data_types`(
`col1` array<string>,
`col2` map<int,string>,
`col3` struct<c1:smallint,c2:varchar(30)>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '&'
MAP KEYS TERMINATED BY '#';
注: union 可以用同样的方法取
创建一个 csv 文件:
arr1&arr2,101#map1&102#map2,11&varchar_1
arr3&arr4,103#map3&104#map4,12&varchar_2
在配置单元中加载此数据 table:
LOAD DATA LOCAL INPATH '/home/dev/complex_data.csv' into table complex_data_types;
注意:假设文件位于/home/dev/complex_data.csv