是否可以将配置单元 table 格式转换为 ORC 并将其分桶

Question

我有一组配置单元 table，它们不是 ORC 格式，也没有分桶。我想将它们的格式更改为 ORC 并使它们分桶。在整个网络中找不到具体的答案。任何答案或指导表示赞赏。配置单元版本为 2.3.5

或者是否可以在 spark（pyspark 或 scala）中完成？

最简单的解决方案是创建一个新的 table，它是分桶的并且是 ORC 格式，然后从旧的 table 插入它。寻找就地解决方案。

Answer 1

创建 bucketed table 并使用 INSERT OVERWRITE 将数据加载到其中：

CREATE TABLE table_bucketed(col1 string, col2 string)
CLUSTERED BY(col1) INTO 10 BUCKETS
STORED AS ORC;

INSERT OVERWRITE TABLE table_bucketed
select ...
  from table_not_bucketed

另见 Sorted Bucketed Table.

Answer 2

蜂巢： 使用暂存 table 使用这些命令读取未分桶的数据（假设 TEXTFILE 格式）：

CREATE TABLE staging_table(
    col1 colType, 
    col2 colType, ...
    coln colType
)
STORED AS 
    TEXTFILE
LOCATION 
    '/path/of/input/data';

CREATE TABLE target_table(
    col1 colType, 
    col2 colType, ...
    coln colType
)
CLUSTERED BY(col1) INTO 10 BUCKETS
STORED AS ORC;

INSERT OVERWRITE TABLE table_bucketed
SELECT 
    col1, col2, ..., coln
FROM 
    staging_table;

同样可以在 **Spark** DataFrame APIs 中完成（假设 CSV 格式），如下所示：

df = spark.read.format("csv")
          .option("inferSchema", "true")
          .option("header", "true")
          .option("delimiter", ",")
          .option("path", "/path/of/input/data/")
          .load()

df.write.format("orc")
        .option("path", "/path/of/output/data/")
        .save()

是否可以将配置单元 table 格式转换为 ORC 并将其分桶

Is it possible to convert a hive table format to ORC and make it bucketed

hive

acid

orc