使用非标准分隔符从 s3 复制到 redshift

Question

这让我很头疼。我在 redshift

中有一个简单的 table

create table data.texttest(
    col1 int null,
    col2 int null,
    col3 varchar(256) null,
    col4 int null,
    col5 int null
);

下面是 gzip 文件，其中行分隔符是 LF（无 CR）

col-1þcol-2þcol-3þcol-4þcol5
1268437þ1268437þSome Textþ0þ
1268437þ1268443þSome Textþ0þ
1268437þ1881096þSome Textþ0þ
1268437þ1881109þSome Textþ0þ
1268437þ1881114þSome Textþ0þ
1268437þ1881115þSome Textþ0þ
1268437þ1881129þSome Textþ0þ
1268437þ2807685þSome Textþ0þ
2931841þ2931841þSome Textþ0þ
1268437þ3368478þSome Textþ0þ
1268437þ4339135þSome Textþ0þ
1268437þ4357980þSome Textþ0þ
1268437þ4483058þSome Textþ0þ

加载很简单...

copy data.texttest (col1,col2,col3,col4,col5) from 's3://<bucket>/<file_name>.log.gz' with credentials 'aws_access_key_id=<>;aws_secret_access_key=<>' delimiter '6' gzip ignoreheader 1;`

但是唉……没有。我在 col1

上不断收到以下红润错误

1214 | Delimiter not found

当我用逗号手动替换þ（小刺，'\376'）时，redshift 很开心。显然我不能在实际过程中改变它。我在这里遗漏了什么吗？

感谢任何帮助。

Answer 1

我们遇到了同样的问题，我们解决了用制表符替换 þ 字符的问题。

如果您使用 linux 加载数据，您可以使用以下命令：

sed -i 's/þ/\t/g' /your/file/path/file_name.extension

选项-i 允许命令覆盖原始文件。要保留原始文件，请使用：

sed -e 's/þ/\t/g' /your/file/path/file_name.extension > newfile_name.extension

Answer 2

分隔符必须是 ASCII (COPY documentation)

Single ASCII character that is used to separate fields in the input file

因此需要进行一些预处理。 sed 答案很好，这也可以与 SSH 摄取（而不是复制）结合使用，因此您可以修改流中的字符 'on the way in' 而不是先重写数据。

第二种方法，有点麻烦，是通过 COPY 加载到带有单个文本列的暂存 table，然后使用组合将其处理到目标 table SPLIT_PART 和 create target as select split_part(... 或 insert into... select split_part(... 样式查询。 Split_part 不要求其分隔符为 ASCII，至少根据文档不要求。

使用非标准分隔符从 s3 复制到 redshift

copy from s3 to redshift with a non standard delimiter

copy

amazon-s3

amazon-redshift