将 JSON 数据加载到 AWS Redshift 导致 NULL 值

Question

我正在尝试执行 load/copy 操作以将数据从 S3 存储桶中的 JSON 文件直接导入到 Redshift。 COPY操作成功，COPY后table有正确的rows/records，但是每条记录都是NULL!

加载 COPY 命令需要花费预期的时间 returns 好的，Redshift 控制台报告成功并且没有错误...但是如果我从 table, 它 returns 只有 NULL 值。

JSON 非常简单 + 平坦，格式正确（根据我在这里找到的示例：http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html）

基本上每行一行，格式如下：

{ "col1": "val1", "col2": "val2", ... }
{ "col1": "val1", "col2": "val2", ... }
{ "col1": "val1", "col2": "val2", ... }

我已经尝试过一些事情，比如根据 JSON 对象中的值和数据类型重写模式，以及从未压缩的文件中复制。我想也许 JSON 在加载时没有被正确解析，但如果无法解析对象，它应该会引发错误。

我的 COPY 命令如下所示：

copy events from 's3://mybucket/json/prefix' 
with credentials 'aws_access_key_id=xxx;aws_secret_access_key=xxx'
json 'auto' gzip;

任何指导将不胜感激！谢谢

Answer 1

所以我找到了原因 - 从我在原始 post 中提供的描述来看，这并不明显。

当您在 Redshift 中创建 table 时，列名称将转换为小写。当您执行 COPY 操作时，列名称区分大小写。

我一直在尝试加载的输入数据对列名使用驼峰式命名，因此当我执行 COPY 时，列与定义的模式不匹配（现在使用所有小写列名）

不过，该操作不会引发错误。它只是在所有不匹配的列中留下 NULL（在本例中，所有列）

希望这有助于避免同样的困惑！

Answer 2

对于 JSON 数据对象不直接对应于列名称的情况，您可以使用 JSON 路径文件将 JSON 元素映射到列，如 TimZ 和描述 here

Answer 3

COPY maps the data elements in the JSON source data to the columns in the target table by matching object keys, or names, in the source name/value pairs to the names of columns in the target table. The matching is case-sensitive. Column names in Amazon Redshift tables are always lowercase, so when you use the ‘auto’ option, matching JSON field names must also be lowercase. If the JSON field name keys aren't all lowercase, you can use a JSONPaths file to explicitly map column names to JSON field name keys.

解决方案是使用 jsonpath

示例json：

{
"Name": "Major",
"Age": 19,
"Add": {
"street":{
"st":"5 maint st",
"ci":"Dub"
},
"city":"Dublin"
},

"Category_Name": ["MLB","GBM"]

}

示例table：

(
name varchar,
age int,
address varchar,
catname varchar
);

示例json路径：

{
"jsonpaths": [
"$['Name']",
"$['Age']",
"$['Add']",
"$['Category_Name']"
]
}

示例复制代码：

copy customer --redshift code
from 's3://mybucket/customer.json'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
json from 's3://mybucket/jpath.json' ; -- Jsonpath file to map fields

例子取自here

Answer 4

这可能是因为 redshift table 的列名是小写的，而 JSON 文件中的列名是大写的（或驼峰式）。作为解决方法，我们可以使用 'auto ignorecase' 而不是 'auto' 选项，并且 redshift 会尝试匹配相应的列。 https://docs.aws.amazon.com/en_us/redshift/latest/dg/copy-parameters-data-format.html#copy-json

复制参数部分提到了该信息。

Answer 5

现在有一个选项可以在将 json 数据从 s3

加载到 Redshift 时忽略大小写

COPY crypt.public.coindetails FROM 's3://cryptstreaxxxx/filetest2.json'
IAM_ROLE 'arn:aws:iam::xxxxxxx:role/service-role/AmazonRedshift-CommandsAccessRole-20211201T210748' 
FORMAT AS JSON 'auto ignorecase' REGION AS 'us-east-1'

在 Redshift UI 中点击 File Options 并选择如下图所示的选项

Answer 6

报告我的经验可能对其他人有用。

在我的例子中，我使用 INSERT 语句加载数据，我也有 camel-case 字段。当我尝试查询 JSON 列的字段时，结果是 null。

所以我必须将特定字段加载为

INSERT INTO my_schema.my_table (
    SELECT json_parse(lower(my_json),
    [...]
    FROM [...]
);

在源 JSON 字段小写后，我能够正确查询 JSON 字段。

https://docs.aws.amazon.com/redshift/latest/dg/JSON_PARSE.html https://docs.aws.amazon.com/redshift/latest/dg/r_LOWER.html

将 JSON 数据加载到 AWS Redshift 导致 NULL 值

Loading JSON data to AWS Redshift results in NULL values

amazon-web-services

amazon-redshift