在配置单元中加载结构或任何其他复杂数据类型

load struct or any other complex data type in hive

我有一个 .xlsx 文件,其中包含类似于下图的数据,我正在尝试使用下面的创建查询创建

CREATE TABLE aus_aboriginal(
    code int,
    area_name string,
    male_0_4 STRUCT<num:double, total:double, perc:double>,
    male_5_9 STRUCT<num:double, total:double, perc:double>,
    male_10_14 STRUCT<num:double, total:double, perc:double>,
    male_15_19 STRUCT<num:double, total:double, perc:double>,
    male_20_24 STRUCT<num:double, total:double, perc:double>,
    male_25_29 STRUCT<num:double, total:double, perc:double>,
    male_30_34 STRUCT<num:double, total:double, perc:double>,
    male_35_39 STRUCT<num:double, total:double, perc:double>,
    male_40_44 STRUCT<num:double, total:double, perc:double>,
    male_45_49 STRUCT<num:double, total:double, perc:double>,
    male_50_54 STRUCT<num:double, total:double, perc:double>,
    male_55_59 STRUCT<num:double, total:double, perc:double>,
    male_60_64 STRUCT<num:double, total:double, perc:double>,
    male_above_65 STRUCT<num:double, total:double, perc:double>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

当我将数据加载到其中时,我得到 nulls 我在 CREATE TABLE.. 中缺少什么?

您还在 CREATE 语句中为结构类型添加分隔符,如下所示:

CREATE TABLE aus_aboriginal( code INT, area_name STRING, 
male_0_4 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_5_9 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_10_14 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_15_19 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_20_24 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_25_29 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_30_34 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>,
male_35_39 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_40_44 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_45_49 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_50_54 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_55_59 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_60_64 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>, 
male_above_65 STRUCT<num:DOUBLE, total:DOUBLE, perc:DOUBLE>) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
COLLECTION ITEMS TERMINATED BY ':';

您可以有一个示例查询,例如:

SELECT code, male_0_4.num, male_0_4.total, male_0_4.perc FROM aus_aboriginal;

在使用像结构这样的复杂类型时,建议使用唯一的分隔符来收集,而不是用于字段(列)的分隔符。 考虑以下格式的 csv 文件,其中使用“,”逗号分隔符。 Input.csv

Code, area_name,num,total,perc,num,total,perc,num,total,perc 1100,Albury,90,444,17.4,73,546,13.4,86,546,15.8

1111,armid,40,404,14.4,97,701,13.8,76,701,10.8

预期结果是从字段(num、total 和 perc)中创建一个复杂类型:

1100,Albury,struct<90,444,17.4>,struct<73,546,13.4>,struct<86,546,15.8>

1111,armid, struct<40,404,14.4>, struct<97,701,13.8>,struct<76,701,10.8>

在这种情况下,当我们尝试使用以下配置单元查询从字段(num、total 和 perc)创建复杂类型时,我们将在 table 中获得多个空值,因为相同的“,”逗号分隔符用于字段和集合,因此 Hive 查询未能按我们的要求分隔数据。

Hive> create table aus_aboriginal( code int, area_name string, male_0_4 STRUCT<num:double, total:double, perc:double>, male_5_9 STRUCT<num:double, total:double, perc:double>, male_10_14 STRUCT<num:double, total:double, perc:double>) ROW FORMAT DELIMITED FIELDS TERMINATED BY  ',' COLLECTION ITEMS TERMINATED BY ',' LOCATION '/csv';

输出:

1100 Albury {"num":90.0,"total":null,"perc":null} {"num":444.0,"total":nul l,"perc":null} {"num":17.4,"total":null,"perc":null}

1111 armid {"num":40.0,"total":null,"perc":null} {"num":404.0,"total":nul l,"perc":null} {"num":14.4,"total":null,"perc":null}

Time taken: 0.15 seconds, Fetched: 2 row(s)

我怀疑您遇到了这个问题。

结构的使用 现在考虑具有以下格式数据的输入文件,其中“,”逗号分隔符用于字段,集合项“#”用作分隔符。

1100,Albury,90#444#17.4,73#546#13.4,86#546#15.8

1111,armid,40#404#14.4,97#701#13.8,76#701#10.8

在这种情况下,我们可以通过指定 # 作为集合项的分隔符和 , 为字段成功创建具有复杂类型的 table。请检查下面的配置单元查询。

hive> create table aus_aboriginal( code int, area_name string, male_0_4 STRUCT<num:double, total:double, perc:double>, male_5_9 STRUCT<num:double, total:double, perc:double>, male_10_14 STRUCT<num:double, total:double, perc:double>) ROW FORMAT DELIMITED FIELDS TERMINATED BY  ',' COLLECTION ITEMS TERMINATED BY '#' LOCATION '/csv';

输出:

hive> select * from aus_aboriginal;

1100 Albury {"num":90.0,"total":444.0,"perc":17.4} {"num":73.0,"total":546. 0,"perc":13.4} {"num":86.0,"total":546.0,"perc":15.8}

1111 armid {"num":40.0,"total":404.0,"perc":14.4} {"num":97.0,"total":701. 0,"perc":13.8} {"num":76.0,"total":701.0,"perc":10.8}

Time taken: 0.146 seconds, Fetched: 2 row(s)

其他复杂类型也应采用类似的方法,请参阅下文 link 了解更多信息。

参考: http://edu-kinect.com/blog/2014/06/16/hive-complex-data-types-with-examples/

创建配置单元 table 使用:

CREATE TABLE `complex_data_types`(
  `col1` array<string>, 
  `col2` map<int,string>, 
  `col3` struct<c1:smallint,c2:varchar(30)>)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  COLLECTION ITEMS TERMINATED BY '&' 
  MAP KEYS TERMINATED BY '#';

注: union 可以用同样的方法取

创建一个 csv 文件:

arr1&arr2,101#map1&102#map2,11&varchar_1
arr3&arr4,103#map3&104#map4,12&varchar_2

在配置单元中加载此数据 table:

LOAD DATA LOCAL INPATH '/home/dev/complex_data.csv' into table complex_data_types;

注意:假设文件位于/home/dev/complex_data.csv