Redshift Copy 命令标识列由于切片数而为备用值

Question

我正在尝试在 Redshift 的标识列中实现顺序增量值，同时运行复制命令。

Redshift-Identity column SEED-STEP behavior with COPY command 是一篇很好的文章，我遵循它慢慢地朝着我的目标前进，但即使按照列表中的最后一步并使用清单文件，我也只能得到（或者递增）1,3 ,5,7... 或 2,4,6,8... ID 列值。

在创建 table 时，我将该列指定为：

  bucketingid                             INT IDENTITY(1, 1) sortkey

我能理解这种行为是因为我的 dc2.large 单节点集群有 2 个切片，因此我遇到了问题。

我正在尝试将单个 csv 文件从 S3 上传到 redshift。

如何实现顺序增量 ID？

Answer 1

IDENTITY 列不能保证产生连续的值。它保证分配唯一且单调的值。

加载数据后，您可以使用 sql 解决您的问题：

CREATE TABLE my_table_with_consecutive_ids AS 
    SELECT 
       row_number() over (order by bucketingid) as consecutive_bucketingid, 
       *
    FROM my_table

出现问题的一些解释：

由于COPY对您的数据执行分布式加载，并且每个文件都由一个节点切片加载，因此仅加载一个文件将由单个切片处理。为了能够在不同切片并行加载数据时保证唯一值，它们中的每一个都使用一个 space 自己独有的身份（有 2 个切片，一个使用奇数，另一个使用偶数）。

理论上，如果您将文件分成两部分（或您的集群拥有的切片数量）并将这两个切片用于加载（您需要使用 MANIFEST 文件），但这是非常不切实际的，并且您还对集群大小进行了假设。

来自CREATE TABLE manual的相同解释：

IDENTITY(seed, step)

... With a COPY operation, the data is loaded in parallel and distributed to the node slices. To be sure that the identity values are unique, Amazon Redshift skips a number of values when creating the identity values. As a result, identity values are unique and sequential, but not consecutive, and the order might not match the order in the source files.

Redshift Copy 命令标识列由于切片数而为备用值

Redshift Copy command identity column is alternate value due to number of slices

identity-column

csv-import

sql-insert

amazon-redshift

出现问题的一些解释：