使用大表连接更新 Amazon Redshift 中的列

Question

我有 500M 行和 30 列 table（带有 bigint ID 列），我们称它为 big_one。另外，我还有另一个 table extra_one 具有相同的行数和相同的 ID 列，但是有两个新列包含我想包含在第一个 table 中的额外数据. 我在第一个 table 中添加了两个额外的列，并希望根据连接更新数据。

查询很简单：

update big_one set
    col1=extra_one.col1,
    col2=extra_one.col2
from extra_one
where big_one.id=extra_one.id;

但在执行期间，磁盘 space 使用率急剧增加，高达 100%。在开始之前，我在 4 个节点上有 23.41% 的空闲 space（每个节点 160GB，总共 640GB）。 big_one table 最初使用了 space 的大约 18%。这 23.41% 表示我有大约 490GB 的可用磁盘 space 可以顺利执行更新。但 Redhisft 的想法不同。

两个新列是 md5 哈希值（因此它们的长度为 32 个字符）（理想情况下它应该占用 16GB 的 space）。

回顾：

我有宽tablebig_one.
有另一个table extra_one（总共3列），具有相同的ID和记录数。
我向 big_one 添加了两个新列。
我想用 extra_one 中的数据丰富 big_one。（进入那 2 个新列）

Q1:关于如何执行如此大的更新有什么建议吗？

Q2: 如果我创建一个 VIEW 将连接两个 table 然后使用它，它会不会让我免于这样的 space 流失情况？在这种情况下，Redshift 如何使用 VIEW（未具体化）。

Answer 1

不要在大量行上使用 UPDATE。

当在 Amazon Redshift 中修改一行时，现有行被标记为 Deleted 并且新行附加到 table。这将有效地使 table 的大小加倍并浪费大量磁盘 space，直到 table 被清理。也很慢！

改为：

创建一个 JOINs 两个 tables
使用查询 填充新的 table（见下文）
删除旧的 table 并 重命名新的 table 以替换原来的 table（或者，截断原来的 table 并将数据复制回其中）

您可以使用 CREATE TABLE LIKE 基于现有的 table 创建一个新的空 table。

来自CREATE TABLE - Amazon Redshift：

LIKE parent_table [ { INCLUDING | EXCLUDING } DEFAULTS ]
A clause that specifies an existing table from which the new table automatically copies column names, data types, and NOT NULL constraints. The new table and the parent table are decoupled, and any changes made to the parent table aren't applied to the new table. Default expressions for the copied column definitions are copied only if INCLUDING DEFAULTS is specified. The default behavior is to exclude default expressions, so that all columns of the new table have null defaults.

Tables created with the LIKE option don't inherit primary and foreign key constraints. Distribution style, sort keys,BACKUP, and NULL properties are inherited by LIKE tables, but you can't explicitly set them in the CREATE TABLE ... LIKE statement.

使用大表连接更新 Amazon Redshift 中的列

Update column in Amazon Redshift with join for big tables

sql

amazon-web-services

database-performance

sql-update

amazon-redshift