AWS Glue Crawler - DynamoDB Export - 在架构而不是结构中获取属性名称

Question

我已经在从 dynamodb 导出的数据目录中定义了默认爬虫。我试图让它给我一个结构化的 table 而不是具有单列结构类型的 table 。我必须做什么才能在其中获取实际的列名？我已经尝试添加自定义分类器和不同的路径表达式，但似乎没有任何效果，而且我觉得我遗漏了一些非常明显的东西。

我在 glue 中使用爬虫构建器，它似乎没有提供太多定制。

这是默认搜寻器生成的 table 的架构：

这是我从 dynamo 导出的项目之一：

{
    "Item": {
        "the_url": {
            "S": "/2021/07/06/****redacted****.html"
        },
        "as_of_when": {
            "S": "2021-09-01"
        },
        "user_hashes": {
            "SS": [
                "****redacted*****"
            ]
        },
        "user_id_hashes": {
            "SS": [
                "u3MeXDcpQm0ACYuUv6TMrg=="
            ]
        },
        "accumulated_count": {
            "N": "1"
        },
        "today_count": {
            "N": "1"
        }
    }
}

Answer 1

Athena 解释 JSON 数据的方式意味着您的数据只有一列 Item。 Athena 没有任何机制可以将 JSON 对象的任意部分映射到列，它只能将顶级属性映射到列。

如果您希望对象的其他部分作为列，您将必须使用转换后的数据创建一个新的 table，或者创建一个将属性作为列的视图，例如

CREATE OR REPLACE VIEW attributes_as_top_level_columns AS
SELECT
  item.the_url.S AS the_url,
  CAST(item.as_of_when.S AS DATE) AS as_of_when,
  item.user_hashes.SS AS user_hashes,
  item.user_id_hashes.SS AS user_id_hashes,
  item.accumulated_count.N AS accumulated_count,
  item.today_count.N AS today_count
FROM items

在上面的示例中，我还展平了数据类型键（S、SS、N）并将日期字符串转换为日期。

AWS Glue Crawler - DynamoDB Export - 在架构而不是结构中获取属性名称

AWS Glue Crawler - DynamoDB Export - Get attribute names in schema instead of struct

json

amazon-dynamodb

amazon-athena

aws-glue