Redshift 副本从分析中创建不同的压缩编码

Question

我注意到 AWS Redshift 建议使用不同的列压缩编码，这与它在将数据（通过 COPY）加载到空 table.

时自动创建的编码不同。

例如我创建了一个table并从S3加载数据如下：

CREATE TABLE Client (Id varchar(511) , ClientId integer , CreatedOn timestamp, 
UpdatedOn timestamp ,  DeletedOn timestamp , LockVersion integer , RegionId 
varchar(511) , OfficeId varchar(511) , CountryId varchar(511) ,  
FirstContactDate timestamp , DidExistPre boolean , IsActive boolean , 
StatusReason integer ,  CreatedById varchar(511) , IsLocked boolean , 
LockType integer , KeyWorker varchar(511) ,  InactiveDate timestamp , 
Current_Flag varchar(511) );

Table Client created Execution time: 0.3s

copy Client from 's3://<bucket-name>/<folder>/Client.csv' 
credentials 'aws_access_key_id=<access key>; aws_secret_access_key=<secret>' 
csv fillrecord truncatecolumns ignoreheader 1 timeformat as 'YYYY-MM-
DDTHH:MI:SS' gzip acceptinvchars compupdate on region 'ap-southeast-2';

Warnings: Load into table 'client' completed, 24284 record(s) loaded successfully. Load into table 'client' completed, 6 record(s) were loaded with replacements made for ACCEPTINVCHARS. Check 'stl_replacements' system table for details.

0 rows affected COPY executed successfully

Execution time: 3.39s

完成此操作后，我可以查看 COPY 应用的列压缩编码：

select "column", type, encoding, distkey, sortkey, "notnull" 
from pg_table_def where tablename = 'client';

给予：

╔══════════════════╦═════════════════════════════╦═══════╦═══════╦═══╦═══════╗
║ id               ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
║ clientid         ║ integer                     ║ delta ║ false ║ 0 ║ false ║
║ createdon        ║ timestamp without time zone ║ lzo   ║ false ║ 0 ║ false ║
║ updatedon        ║ timestamp without time zone ║ lzo   ║ false ║ 0 ║ false ║
║ deletedon        ║ timestamp without time zone ║ none  ║ false ║ 0 ║ false ║
║ lockversion      ║ integer                     ║ delta ║ false ║ 0 ║ false ║
║ regionid         ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
║ officeid         ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
║ countryid        ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
║ firstcontactdate ║ timestamp without time zone ║ lzo   ║ false ║ 0 ║ false ║
║ didexistprecirts ║ boolean                     ║ none  ║ false ║ 0 ║ false ║
║ isactive         ║ boolean                     ║ none  ║ false ║ 0 ║ false ║
║ statusreason     ║ integer                     ║ none  ║ false ║ 0 ║ false ║
║ createdbyid      ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
║ islocked         ║ boolean                     ║ none  ║ false ║ 0 ║ false ║
║ locktype         ║ integer                     ║ lzo   ║ false ║ 0 ║ false ║
║ keyworker        ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
║ inactivedate     ║ timestamp without time zone ║ lzo   ║ false ║ 0 ║ false ║
║ current_flag     ║ character varying(511)      ║ lzo   ║ false ║ 0 ║ false ║
╚══════════════════╩═════════════════════════════╩═══════╩═══════╩═══╩═══════╝

然后我可以做：

analyze compression client;

给予：

╔════════╦══════════════════╦═══════╦═══════╗
║ client ║ id               ║ zstd  ║ 40.59 ║
║ client ║ clientid         ║ delta ║ 0.00  ║
║ client ║ createdon        ║ zstd  ║ 19.85 ║
║ client ║ updatedon        ║ zstd  ║ 12.59 ║
║ client ║ deletedon        ║ raw   ║ 0.00  ║
║ client ║ lockversion      ║ zstd  ║ 39.12 ║
║ client ║ regionid         ║ zstd  ║ 54.47 ║
║ client ║ officeid         ║ zstd  ║ 88.84 ║
║ client ║ countryid        ║ zstd  ║ 79.13 ║
║ client ║ firstcontactdate ║ zstd  ║ 22.31 ║
║ client ║ didexistprecirts ║ raw   ║ 0.00  ║
║ client ║ isactive         ║ raw   ║ 0.00  ║
║ client ║ statusreason     ║ raw   ║ 0.00  ║
║ client ║ createdbyid      ║ zstd  ║ 52.43 ║
║ client ║ islocked         ║ raw   ║ 0.00  ║
║ client ║ locktype         ║ zstd  ║ 63.01 ║
║ client ║ keyworker        ║ zstd  ║ 38.79 ║
║ client ║ inactivedate     ║ zstd  ║ 25.40 ║
║ client ║ current_flag     ║ zstd  ║ 90.51 ║
╚════════╩══════════════════╩═══════╩═══════╝

即截然不同的结果。

我很想知道为什么会这样？我知道 ~24K 记录少于 AWS specifies 有意义的压缩分析样本所需的 100K，但是对于同一 24K 行 table COPY 和 ANALYZE 给出不同的结果似乎仍然很奇怪.

Answer 1

COPY 目前不推荐 ZSTD，这就是推荐的压缩设置不同的原因。

如果您希望在永久性 table 上应用压缩，您希望在其中最大化压缩（使用最少的 space），全面设置 ZSTD 将使您接近最佳状态压缩。

RAW 在某些列上恢复的原因是因为在这种情况下应用压缩没有优势（压缩和未压缩的块数相同）。如果您知道 table 会增长，那么也可以对这些列应用压缩。

Redshift 副本从分析中创建不同的压缩编码

Redshift copy creates different compression encodings from analyze

sql

amazon-s3

sqlbulkcopy

data-compression

amazon-redshift