从 Parquet S3 复制到 Redshift 和 decimal 与 int 类型

Question

我运行在尝试将数据从 S3 中的 Parquet 复制到 Redshift 时遇到此错误：

S3 Query Exception (Fetch). Task failed due to an internal error. File
 'https://...../part-00000-xxxxx.snappy.parquet  
has an incompatible Parquet schema for column 's3://table_name/.column_name'. 
Column type: INT, Parquet schema:
optional fixed_len_byte_array COLUMN_NAME

我怀疑这是因为 Parquet 文件有一个 numeric/decimal 类型，其精度高于适合 INT 列的精度，但是我相信所有实际值都在它们可能的范围内合身。（错误未指定行号。）

有没有办法在 COPY 上强制进行类型转换，并在单个行的基础上进行失败（与 CSV 一样）而不是使整个文件失败？

Answer 1

在类似问题上花了一天时间，发现无法在 COPY 命令上强制类型。我正在使用 Pandas 构建我的镶木地板文件，并且必须将数据类型与 Redshift 中的数据类型相匹配。对于整数，我有 Pandas int64 和 Redshift BIGINT。同样，我不得不将 NUMERIC 列更改为 DOUBLE PRECISION (Pandas float64).

文件作为一个整体失败，因为柱状文件（如 parquet）的 COPY 命令复制了整个列，然后移动到下一个。所以没有办法让每一行都失败。参见 AWS Documentation。

从 Parquet S3 复制到 Redshift 和 decimal 与 int 类型

COPY from Parquet S3 into Redshift and decimal vs. int types

amazon-redshift

parquet