Redshift 副本从分析中创建不同的压缩编码
Redshift copy creates different compression encodings from analyze
我注意到 AWS Redshift 建议使用不同的列压缩编码,这与它在将数据(通过 COPY)加载到空 table.
时自动创建的编码不同。
例如我创建了一个table并从S3加载数据如下:
CREATE TABLE Client (Id varchar(511) , ClientId integer , CreatedOn timestamp,
UpdatedOn timestamp , DeletedOn timestamp , LockVersion integer , RegionId
varchar(511) , OfficeId varchar(511) , CountryId varchar(511) ,
FirstContactDate timestamp , DidExistPre boolean , IsActive boolean ,
StatusReason integer , CreatedById varchar(511) , IsLocked boolean ,
LockType integer , KeyWorker varchar(511) , InactiveDate timestamp ,
Current_Flag varchar(511) );
Table Client created Execution time: 0.3s
copy Client from 's3://<bucket-name>/<folder>/Client.csv'
credentials 'aws_access_key_id=<access key>; aws_secret_access_key=<secret>'
csv fillrecord truncatecolumns ignoreheader 1 timeformat as 'YYYY-MM-
DDTHH:MI:SS' gzip acceptinvchars compupdate on region 'ap-southeast-2';
Warnings: Load into table 'client' completed, 24284 record(s)
loaded successfully. Load into table 'client' completed, 6
record(s) were loaded with replacements made for ACCEPTINVCHARS. Check
'stl_replacements' system table for details.
0 rows affected COPY executed successfully
Execution time: 3.39s
完成此操作后,我可以查看 COPY 应用的列压缩编码:
select "column", type, encoding, distkey, sortkey, "notnull"
from pg_table_def where tablename = 'client';
给予:
╔══════════════════╦═════════════════════════════╦═══════╦═══════╦═══╦═══════╗
║ id ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
║ clientid ║ integer ║ delta ║ false ║ 0 ║ false ║
║ createdon ║ timestamp without time zone ║ lzo ║ false ║ 0 ║ false ║
║ updatedon ║ timestamp without time zone ║ lzo ║ false ║ 0 ║ false ║
║ deletedon ║ timestamp without time zone ║ none ║ false ║ 0 ║ false ║
║ lockversion ║ integer ║ delta ║ false ║ 0 ║ false ║
║ regionid ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
║ officeid ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
║ countryid ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
║ firstcontactdate ║ timestamp without time zone ║ lzo ║ false ║ 0 ║ false ║
║ didexistprecirts ║ boolean ║ none ║ false ║ 0 ║ false ║
║ isactive ║ boolean ║ none ║ false ║ 0 ║ false ║
║ statusreason ║ integer ║ none ║ false ║ 0 ║ false ║
║ createdbyid ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
║ islocked ║ boolean ║ none ║ false ║ 0 ║ false ║
║ locktype ║ integer ║ lzo ║ false ║ 0 ║ false ║
║ keyworker ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
║ inactivedate ║ timestamp without time zone ║ lzo ║ false ║ 0 ║ false ║
║ current_flag ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
╚══════════════════╩═════════════════════════════╩═══════╩═══════╩═══╩═══════╝
然后我可以做:
analyze compression client;
给予:
╔════════╦══════════════════╦═══════╦═══════╗
║ client ║ id ║ zstd ║ 40.59 ║
║ client ║ clientid ║ delta ║ 0.00 ║
║ client ║ createdon ║ zstd ║ 19.85 ║
║ client ║ updatedon ║ zstd ║ 12.59 ║
║ client ║ deletedon ║ raw ║ 0.00 ║
║ client ║ lockversion ║ zstd ║ 39.12 ║
║ client ║ regionid ║ zstd ║ 54.47 ║
║ client ║ officeid ║ zstd ║ 88.84 ║
║ client ║ countryid ║ zstd ║ 79.13 ║
║ client ║ firstcontactdate ║ zstd ║ 22.31 ║
║ client ║ didexistprecirts ║ raw ║ 0.00 ║
║ client ║ isactive ║ raw ║ 0.00 ║
║ client ║ statusreason ║ raw ║ 0.00 ║
║ client ║ createdbyid ║ zstd ║ 52.43 ║
║ client ║ islocked ║ raw ║ 0.00 ║
║ client ║ locktype ║ zstd ║ 63.01 ║
║ client ║ keyworker ║ zstd ║ 38.79 ║
║ client ║ inactivedate ║ zstd ║ 25.40 ║
║ client ║ current_flag ║ zstd ║ 90.51 ║
╚════════╩══════════════════╩═══════╩═══════╝
即截然不同的结果。
我很想知道为什么会这样?我知道 ~24K 记录少于 AWS specifies 有意义的压缩分析样本所需的 100K,但是对于同一 24K 行 table COPY 和 ANALYZE 给出不同的结果似乎仍然很奇怪.
COPY 目前不推荐 ZSTD,这就是推荐的压缩设置不同的原因。
如果您希望在永久性 table 上应用压缩,您希望在其中最大化压缩(使用最少的 space),全面设置 ZSTD 将使您接近最佳状态压缩。
RAW 在某些列上恢复的原因是因为在这种情况下应用压缩没有优势(压缩和未压缩的块数相同)。如果您知道 table 会增长,那么也可以对这些列应用压缩。
我注意到 AWS Redshift 建议使用不同的列压缩编码,这与它在将数据(通过 COPY)加载到空 table.
时自动创建的编码不同。例如我创建了一个table并从S3加载数据如下:
CREATE TABLE Client (Id varchar(511) , ClientId integer , CreatedOn timestamp,
UpdatedOn timestamp , DeletedOn timestamp , LockVersion integer , RegionId
varchar(511) , OfficeId varchar(511) , CountryId varchar(511) ,
FirstContactDate timestamp , DidExistPre boolean , IsActive boolean ,
StatusReason integer , CreatedById varchar(511) , IsLocked boolean ,
LockType integer , KeyWorker varchar(511) , InactiveDate timestamp ,
Current_Flag varchar(511) );
Table Client created Execution time: 0.3s
copy Client from 's3://<bucket-name>/<folder>/Client.csv'
credentials 'aws_access_key_id=<access key>; aws_secret_access_key=<secret>'
csv fillrecord truncatecolumns ignoreheader 1 timeformat as 'YYYY-MM-
DDTHH:MI:SS' gzip acceptinvchars compupdate on region 'ap-southeast-2';
Warnings: Load into table 'client' completed, 24284 record(s) loaded successfully. Load into table 'client' completed, 6 record(s) were loaded with replacements made for ACCEPTINVCHARS. Check 'stl_replacements' system table for details.
0 rows affected COPY executed successfully
Execution time: 3.39s
完成此操作后,我可以查看 COPY 应用的列压缩编码:
select "column", type, encoding, distkey, sortkey, "notnull"
from pg_table_def where tablename = 'client';
给予:
╔══════════════════╦═════════════════════════════╦═══════╦═══════╦═══╦═══════╗
║ id ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
║ clientid ║ integer ║ delta ║ false ║ 0 ║ false ║
║ createdon ║ timestamp without time zone ║ lzo ║ false ║ 0 ║ false ║
║ updatedon ║ timestamp without time zone ║ lzo ║ false ║ 0 ║ false ║
║ deletedon ║ timestamp without time zone ║ none ║ false ║ 0 ║ false ║
║ lockversion ║ integer ║ delta ║ false ║ 0 ║ false ║
║ regionid ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
║ officeid ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
║ countryid ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
║ firstcontactdate ║ timestamp without time zone ║ lzo ║ false ║ 0 ║ false ║
║ didexistprecirts ║ boolean ║ none ║ false ║ 0 ║ false ║
║ isactive ║ boolean ║ none ║ false ║ 0 ║ false ║
║ statusreason ║ integer ║ none ║ false ║ 0 ║ false ║
║ createdbyid ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
║ islocked ║ boolean ║ none ║ false ║ 0 ║ false ║
║ locktype ║ integer ║ lzo ║ false ║ 0 ║ false ║
║ keyworker ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
║ inactivedate ║ timestamp without time zone ║ lzo ║ false ║ 0 ║ false ║
║ current_flag ║ character varying(511) ║ lzo ║ false ║ 0 ║ false ║
╚══════════════════╩═════════════════════════════╩═══════╩═══════╩═══╩═══════╝
然后我可以做:
analyze compression client;
给予:
╔════════╦══════════════════╦═══════╦═══════╗
║ client ║ id ║ zstd ║ 40.59 ║
║ client ║ clientid ║ delta ║ 0.00 ║
║ client ║ createdon ║ zstd ║ 19.85 ║
║ client ║ updatedon ║ zstd ║ 12.59 ║
║ client ║ deletedon ║ raw ║ 0.00 ║
║ client ║ lockversion ║ zstd ║ 39.12 ║
║ client ║ regionid ║ zstd ║ 54.47 ║
║ client ║ officeid ║ zstd ║ 88.84 ║
║ client ║ countryid ║ zstd ║ 79.13 ║
║ client ║ firstcontactdate ║ zstd ║ 22.31 ║
║ client ║ didexistprecirts ║ raw ║ 0.00 ║
║ client ║ isactive ║ raw ║ 0.00 ║
║ client ║ statusreason ║ raw ║ 0.00 ║
║ client ║ createdbyid ║ zstd ║ 52.43 ║
║ client ║ islocked ║ raw ║ 0.00 ║
║ client ║ locktype ║ zstd ║ 63.01 ║
║ client ║ keyworker ║ zstd ║ 38.79 ║
║ client ║ inactivedate ║ zstd ║ 25.40 ║
║ client ║ current_flag ║ zstd ║ 90.51 ║
╚════════╩══════════════════╩═══════╩═══════╝
即截然不同的结果。
我很想知道为什么会这样?我知道 ~24K 记录少于 AWS specifies 有意义的压缩分析样本所需的 100K,但是对于同一 24K 行 table COPY 和 ANALYZE 给出不同的结果似乎仍然很奇怪.
COPY 目前不推荐 ZSTD,这就是推荐的压缩设置不同的原因。
如果您希望在永久性 table 上应用压缩,您希望在其中最大化压缩(使用最少的 space),全面设置 ZSTD 将使您接近最佳状态压缩。
RAW 在某些列上恢复的原因是因为在这种情况下应用压缩没有优势(压缩和未压缩的块数相同)。如果您知道 table 会增长,那么也可以对这些列应用压缩。