Athena - 具有不兼容数据类型的联合表

Athena - Union tables with incompatible data types

我们有两个 table,其中一列的数据类型不同。第一个 table 中的列是 int 类型,而第二个 table 中的同一列是 float/real 类型。如果它是一个裸列,我可以将 CAST 转换为通用类型,这里的问题是,这些列在 struct.

的深处

我得到的错误是,

SYNTAX_ERROR: line 23:1: column 4 in row(priceconfiguration row(maximumvalue integer, minimumvalue integer, type varchar, value integer)) query has incompatible types: Union, row(priceconfiguration row(maximumvalue integer, minimumvalue integer, type varchar, value real))

查询(简化)是,

 WITH t1 AS (
   SELECT
     "so"."createdon"
   , "so"."modifiedon"
   , "so"."deletedon"
   , "so"."createdby"
   , "so"."priceconfiguration"
   , "so"."year"
   , "so"."month"
   , "so"."day"
   FROM
     my_db.raw_price so
   UNION ALL    
    SELECT
     "ao"."createdon"
   , "ao"."modifiedon"
   , "ao"."deletedon"
   , "ao"."createdby"
   , "ao"."priceconfiguration"
   , "ao"."year"
   , "ao"."month"
   , "ao"."day"
   FROM
     my_db.src_price ao
) 
SELECT t1.* FROM t1 ORDER BY "modifiedon" DESC

事实上,真正的 table 比这更复杂,并且列 priceconfiguration 嵌套在 table 的深处。因此 CAST 问题列直接是不可能的,除非所有 struct 都未嵌套到 CAST 有问题的列。

有没有办法在不 unnesting 和 casting 的情况下 UNION 这两个 tables?

解决方案是将 Athena 引擎版本升级到 v2。

V2 引擎对模式演化提供了更多支持。根据 AWS 文档,

Schema evolution support has been added for data in Parquet format.

Added support for reading array, map, or row type columns from partitions where the partition schema is different from the table schema. This can occur when the table schema was updated after the partition was created. The changed column types must be compatible. For row types, trailing fields may be added or dropped, but the corresponding fields (by ordinal) must have the same name.

参考: https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference.html