AWS Glue Crawler 为每个文件定义一个模式

AWS Glue Crawler defines one schema per file

我有以下数据

{
  "0": "x",
  "1": [
    [
      "x",
      {
        "app_instance_id": "x",
        "app_instance_time": "x",
        "page": {
          "url": "x"
        },
        "user_agent": "x",
        "timestamp": "x",
        "session_id": "x",
        "permanent_id": "x",
        "event_category": "x",
        "customer": "x",
        "referrer": {
          "url": "x"
        },
        "ip_address": "x"
      }
    ],
    [
      "x",
      {
        "app_instance_id": "x",
        "app_instance_time": "x",
        "page": {
          "url": "x"
        },
        "user_agent": "x",
        "timestamp": "x",
        "session_id": "x",
        "permanent_id": "x",
        "event_category": "x",
        "customer": "x",
        "referrer": {
          "url": "x"
        },
        "ip_address": "x"
      }
    ]
  ],
  "time": 1627978464738
}{
  "event": "x",
  "userId": "x",
  "badgeId": null,
  "levelId": null,
  "projectId": "x",
  "ua": "x",
  "key": "x",
  "requestMethod": "x",
  "endpoint": "x",
  "customerId": "x",
  "durationMs": 0,
  "responseCode": 200,
  "time": 1627978465804
}{
  "event": "x",
  "userId": "x",
  "badgeId": null,
  "levelId": null,
  "projectId": "x",
  "ua": "x",
  "key": "x",
  "requestMethod": "GET",
  "endpoint": "x",
  "customerId": "x",
  "durationMs": 0,
  "responseCode": 200,
  "time": 1627978465798
}{
  "event": null,
  "ua": "x",
  "browser.name": "Firefox",
  "browser.version": "87.0",
  "browser.major": "87",
  "engine.name": "Gecko",
  "engine.version": "87.0",
  "os.name": "Mac OS",
  "os.version": "10.15",
  "lineCount": 3,
  "data": 20,
  "carrier": "x",
  "spendingNow": 200,
  "client": "x",
  "time": 1619185462317
}{
  "event": null,
  "ua": "x",
  "browser.name": "Chrome",
  "browser.version": "90.0.4430.66",
  "browser.major": "90",
  "engine.name": "Blink",
  "engine.version": "90.0.4430.66",
  "os.name": "Android",
  "os.version": "10",
  "device.vendor": "Samsung",
  "device.model": "SM-G965F",
  "device.type": "mobile",
  "lineCount": 1,
  "data": 25,
  "carrier": "x",
  "spendingNow": 10,
  "client": "x",
  "time": 1619201845480
}

如您所见,它在一个文件中包含 json 个不同模式的对象。但是,当我使用胶水爬虫为我的数据定义 tables 时,它会为整个文件创建一个 table,其中包含所有 json 对象中的所有列(如 0、1、时间、事件、userId、badgeId 等),如下面的屏幕截图所示。

我想做的是告诉爬虫为每个模式创建多个 table,就像它为单独的文件所做的那样。我能做什么?

我认为你做不到。 架构应该描述 通常 文件目录的结构。单个文件有多个模式甚至不允许浏览这个文件的数据,而且没有任何意义

如果您真的想检测不同的架构,最好是清理数据,或者使用具有一致架构的单独文件(在单独的路径中)