AWS Glue Crawler 为每个文件定义一个模式
AWS Glue Crawler defines one schema per file
我有以下数据
{
"0": "x",
"1": [
[
"x",
{
"app_instance_id": "x",
"app_instance_time": "x",
"page": {
"url": "x"
},
"user_agent": "x",
"timestamp": "x",
"session_id": "x",
"permanent_id": "x",
"event_category": "x",
"customer": "x",
"referrer": {
"url": "x"
},
"ip_address": "x"
}
],
[
"x",
{
"app_instance_id": "x",
"app_instance_time": "x",
"page": {
"url": "x"
},
"user_agent": "x",
"timestamp": "x",
"session_id": "x",
"permanent_id": "x",
"event_category": "x",
"customer": "x",
"referrer": {
"url": "x"
},
"ip_address": "x"
}
]
],
"time": 1627978464738
}{
"event": "x",
"userId": "x",
"badgeId": null,
"levelId": null,
"projectId": "x",
"ua": "x",
"key": "x",
"requestMethod": "x",
"endpoint": "x",
"customerId": "x",
"durationMs": 0,
"responseCode": 200,
"time": 1627978465804
}{
"event": "x",
"userId": "x",
"badgeId": null,
"levelId": null,
"projectId": "x",
"ua": "x",
"key": "x",
"requestMethod": "GET",
"endpoint": "x",
"customerId": "x",
"durationMs": 0,
"responseCode": 200,
"time": 1627978465798
}{
"event": null,
"ua": "x",
"browser.name": "Firefox",
"browser.version": "87.0",
"browser.major": "87",
"engine.name": "Gecko",
"engine.version": "87.0",
"os.name": "Mac OS",
"os.version": "10.15",
"lineCount": 3,
"data": 20,
"carrier": "x",
"spendingNow": 200,
"client": "x",
"time": 1619185462317
}{
"event": null,
"ua": "x",
"browser.name": "Chrome",
"browser.version": "90.0.4430.66",
"browser.major": "90",
"engine.name": "Blink",
"engine.version": "90.0.4430.66",
"os.name": "Android",
"os.version": "10",
"device.vendor": "Samsung",
"device.model": "SM-G965F",
"device.type": "mobile",
"lineCount": 1,
"data": 25,
"carrier": "x",
"spendingNow": 10,
"client": "x",
"time": 1619201845480
}
如您所见,它在一个文件中包含 json 个不同模式的对象。但是,当我使用胶水爬虫为我的数据定义 tables 时,它会为整个文件创建一个 table,其中包含所有 json 对象中的所有列(如 0、1、时间、事件、userId、badgeId 等),如下面的屏幕截图所示。
我想做的是告诉爬虫为每个模式创建多个 table,就像它为单独的文件所做的那样。我能做什么?
我认为你做不到。
架构应该描述 通常 文件目录的结构。单个文件有多个模式甚至不允许浏览这个文件的数据,而且没有任何意义
如果您真的想检测不同的架构,最好是清理数据,或者使用具有一致架构的单独文件(在单独的路径中)
我有以下数据
{
"0": "x",
"1": [
[
"x",
{
"app_instance_id": "x",
"app_instance_time": "x",
"page": {
"url": "x"
},
"user_agent": "x",
"timestamp": "x",
"session_id": "x",
"permanent_id": "x",
"event_category": "x",
"customer": "x",
"referrer": {
"url": "x"
},
"ip_address": "x"
}
],
[
"x",
{
"app_instance_id": "x",
"app_instance_time": "x",
"page": {
"url": "x"
},
"user_agent": "x",
"timestamp": "x",
"session_id": "x",
"permanent_id": "x",
"event_category": "x",
"customer": "x",
"referrer": {
"url": "x"
},
"ip_address": "x"
}
]
],
"time": 1627978464738
}{
"event": "x",
"userId": "x",
"badgeId": null,
"levelId": null,
"projectId": "x",
"ua": "x",
"key": "x",
"requestMethod": "x",
"endpoint": "x",
"customerId": "x",
"durationMs": 0,
"responseCode": 200,
"time": 1627978465804
}{
"event": "x",
"userId": "x",
"badgeId": null,
"levelId": null,
"projectId": "x",
"ua": "x",
"key": "x",
"requestMethod": "GET",
"endpoint": "x",
"customerId": "x",
"durationMs": 0,
"responseCode": 200,
"time": 1627978465798
}{
"event": null,
"ua": "x",
"browser.name": "Firefox",
"browser.version": "87.0",
"browser.major": "87",
"engine.name": "Gecko",
"engine.version": "87.0",
"os.name": "Mac OS",
"os.version": "10.15",
"lineCount": 3,
"data": 20,
"carrier": "x",
"spendingNow": 200,
"client": "x",
"time": 1619185462317
}{
"event": null,
"ua": "x",
"browser.name": "Chrome",
"browser.version": "90.0.4430.66",
"browser.major": "90",
"engine.name": "Blink",
"engine.version": "90.0.4430.66",
"os.name": "Android",
"os.version": "10",
"device.vendor": "Samsung",
"device.model": "SM-G965F",
"device.type": "mobile",
"lineCount": 1,
"data": 25,
"carrier": "x",
"spendingNow": 10,
"client": "x",
"time": 1619201845480
}
如您所见,它在一个文件中包含 json 个不同模式的对象。但是,当我使用胶水爬虫为我的数据定义 tables 时,它会为整个文件创建一个 table,其中包含所有 json 对象中的所有列(如 0、1、时间、事件、userId、badgeId 等),如下面的屏幕截图所示。
我想做的是告诉爬虫为每个模式创建多个 table,就像它为单独的文件所做的那样。我能做什么?
我认为你做不到。 架构应该描述 通常 文件目录的结构。单个文件有多个模式甚至不允许浏览这个文件的数据,而且没有任何意义
如果您真的想检测不同的架构,最好是清理数据,或者使用具有一致架构的单独文件(在单独的路径中)