Hive，如何按具有空值的列进行分区，将所有空值放在一个分区中

Question

我用的是Hive，IDE是Hue。我正在尝试为我的分区键选择不同的组合键。

我原文table的定义如下：

CREATE External Table `my_hive_db`.`my_table`(
    `col_id` bigint,
    `result_section__col2` string,
    `result_section_col3` string ,
    `result_section_col4` string,
    `result_section_col5` string,
    `result_section_col6__label` string,
    `result_section_col7__label_id` bigint ,
    `result_section_text` string ,
    `result_section_unit` string,
    `result_section_col` string ,
    `result_section_title` string,
    `result_section_title_id` bigint,
    `col13` string,
    `timestamp` bigint,
    `date_day` string
    )
    PARTITIONED BY ( 
      `date_year` string, 
      `date_month` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3a://some/where/in/amazon/s3';

以上代码运行正常。但是当我用 date_day 作为分区键创建一个新的 table 时，table 是空的，我需要运行 MSCK 修复 Table。但是，我收到以下错误：

编译语句时出错：失败：执行错误，return 来自 org.apache.hadoop.hive.ql.ddl.DDLTask

的代码 1

当分区键为 date_year、date_month 时，MSCK 正常工作。

Table table 的定义我得到的错误如下：

CREATE External Table `my_hive_db`.`my_table`(
    `col_id` bigint,
    `result_section__col2` string,
    `result_section_col3` string ,
    `result_section_col4` string,
    `result_section_col5` string,
    `result_section_col6__label` string,
    `result_section_col7__label_id` bigint ,
    `result_section_text` string ,
    `result_section_unit` string,
    `result_section_col` string ,
    `result_section_title` string,
    `result_section_title_id` bigint,
    `col13` string,
    `timestamp` bigint,
    `date_year` string, 
    `date_month` string
  )
    PARTITIONED BY (
     `date_day` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3a://some/where/in/amazon/s3';

在此之后，以下查询为空：

Select * From `my_hive_db`.`my_table` Limit 10;

因此我运行以下命令：

MSCK REPAIR TABLE `my_hive_db`.`my_table`;

我收到错误：编译语句时出错：失败：执行错误，return 来自 org.apache.hadoop.hive.ql.ddl.DDLTask

的代码 1

我检查了 this link 因为这正是我收到的错误，但是通过使用提供的解决方案：

set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;

我得到一个不同的错误：

处理语句时出错：无法在运行时修改 hive.msck.path.validation。它不在允许在运行时间修改的参数列表中。

我认为我收到这些错误的原因是有超过 2 亿条 date_day 的记录具有空值。

有 31 个不同的日期日期非空值。我想用 32 个分区对我的 table 进行分区，每个分区用于 date_day 字段的不同值，并且所有空值都进入不同的分区。有没有办法这样做（按具有空值的列进行分区）？

如果spark可以实现，我也愿意用

这是通过重新创建 table 来更改分区键的更大问题的一部分，如前所述。

感谢您的帮助。

Answer 1

你好像不太明白Hive的分区是怎么工作的。 Hive 将数据存储到 HDFS（或 S3，或其他一些分布式文件夹）上的文件中。如果您创建一个名为 my_schema.my_table 的非分区镶木地板 table，您将在分布式存储中看到文件存储在文件夹

中

hive/warehouse/my_schema.db/my_table/part_00001.parquet
hive/warehouse/my_schema.db/my_table/part_00002.parquet
...

如果您创建一个按列 p_col 分区的 table，文件将看起来像

hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00002.parquet
...
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00002.parquet
...

命令 MSCK repair table 允许您在创建外部文件时自动重新加载分区 table。

假设您在 s3 上有这样的文件夹：

hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value3/part_00001.parquet

您使用

创建了一个外部 table

CREATE External Table my_schema.my_table(
   ... some columns ...
)
PARTITIONED BY (p_col STRING)

table 将被创建但为空，因为 Hive 尚未检测到分区。您运行 MSCK REPAIR TABLE my_schema.my_table，Hive 将识别您的分区 p_col 与 s3 (/p_col=value1/) 上的分区方案相匹配。

根据我从中了解到的情况，您正试图通过执行

来更改 table 的分区方案

CREATE External Table my_schema.my_table(
   ... some columns ...
)
PARTITIONED BY (p_another_col STRING)

并且您收到一条错误消息，因为 p_another_col 与 s3 中使用的列不匹配，即 p_col。这个错误是完全正常的，因为你所做的没有意义。

如所述，您需要使用不同的分区方案创建第一个 table 的副本。

您应该改为尝试这样的操作：

CREATE External Table my_hive_db.my_table_2(
    `col_id` bigint,
    `result_section__col2` string,
    `result_section_col3` string ,
    `result_section_col4` string,
    `result_section_col5` string,
    `result_section_col6__label` string,
    `result_section_col7__label_id` bigint ,
    `result_section_text` string ,
    `result_section_unit` string,
    `result_section_col` string ,
    `result_section_title` string,
    `result_section_title_id` bigint,
    `col13` string,
    `timestamp` bigint,
    `date_year` string, 
    `date_month` string
)
PARTITIONED BY (`date_day` string)

然后用动态分区

填充新的table

INSERT OVERWRITE TABLE my_hive_db.my_table_2 PARTITION(date_day)
SELECT 
  col_id,
  result_section__col2,
  result_section_col3,
  result_section_col4,
  result_section_col5,
  result_section_col6__label,
  result_section_col7__label_id,
  result_section_text,
  result_section_unit,
  result_section_col,
  result_section_title,
  result_section_title_id,
  col13,
  timestamp,
  date_year,
  date_month,
  date_day
FROM my_hive_db.my_table_1

Hive，如何按具有空值的列进行分区，将所有空值放在一个分区中

Hive, how to partition by a colum with null values, putting all nulls in one partition

hadoop

hive

database-partitioning

apache-spark