AWS S3 Select - 从 json 的 2 个不同级别检索数据

AWS S3 Select - Retrieve data from 2 different levels of a json

我将这个 json 存储在一个 S3 文件中(这实际上是 aws Comprehend EntitiesDetection 作业的输出 => 意味着我很遗憾无法控制这个 json 的组织方式,它由 AWS Job 本身上传到 S3,所以我无法修改此 json 输入的结构):

{"Entities": 
  [
    {"BeginOffset": 1, "EndOffset": 11, "Score": 0.9815415143966675, "Text": "5 start-up", "Type": "QUANTITY"}, {"BeginOffset": 61, "EndOffset": 183, "Score": 0.8883988261222839, "Text": "https://www.smartadserver.com/ac?jump=1&nwid=33&siteid=99773&pgname=other&fmtid=35357&visit=m&tmstp=1568017721&out=nonrich", "Type": "OTHER"}, {"BeginOffset": 212, "EndOffset": 327, "Score": 0.8162660002708435, "Text": "https://www.smartadserver.com/ac?out=nonrich&nwid=33&siteid=99773&pgname=other&fmtid=35357&visit=m&tmstp=1568017721", "Type": "OTHER"}, {"BeginOffset": 337, "EndOffset": 339, "Score": 0.7018660306930542, "Text": "Trump, "Type": "PERSON"}, {"BeginOffset": 364, "EndOffset": 484, "Score": 0.8932908177375793, "Text": "https://www.smartadserver.com/ac?jump=1&nwid=33&siteid=99773&pgname=other&fmtid=247&visit=m&tmstp=1568017721&out=nonrich", "Type": "OTHER"}, {"BeginOffset": 513, "EndOffset": 626, "Score": 0.8157837986946106, "Text": "https://www.smartadserver.com/ac?out=nonrich&nwid=33&siteid=99773&pgname=other&fmtid=247&visit=m&tmstp=1568017721", "Type": "OTHER"}, {"BeginOffset": 636, "EndOffset": 638, "Score": 0.6977631449699402, "Text": "Oprah Winfrey", "Type": "PERSON"}, {"BeginOffset": 963, "EndOffset": 971, "Score": 0.4658013880252838, "Text": "facebook", "Type": "ORGANIZATION"}, {"BeginOffset": 972, "EndOffset": 979, "Score": 0.6886632442474365, "Text": "twitter", "Type": "TITLE"}, {"BeginOffset": 985, "EndOffset": 993, "Score": 0.7970104813575745, "Text": "linkedin", "Type": "ORGANIZATION"}, {"BeginOffset": 994, "EndOffset": 998, "Score": 0.36566048860549927, "Text": "Menu", "Type": "TITLE"}
  ],
  "File": "inputs/stratgies-5-start-up-qui-allient-tech-et-odorat-a634acaa-6549-4c89-93b3-8951ababa032"},


{"Entities": 
  [
    {"BeginOffset": 1, "EndOffset": 13, "Score": 0.9995881915092468, "Text": "Nabil Karoui", "Type": "PERSON"}, {"BeginOffset": 27, "EndOffset": 69, "Score": 0.8302029371261597, "Text": "Constitution \u00e9conomique\" - African Manager", "Type": "TITLE"}, {"BeginOffset": 94, "EndOffset": 126, "Score": 0.48702114820480347, "Text": ".wpb_animate_when_almost_visible", "Type": "OTHER"}, {"BeginOffset": 290, "EndOffset": 298, "Score": 0.47538018226623535, "Text": "Fran\u00e7ais", "Type": "OTHER"}, {"BeginOffset": 299, "EndOffset": 306, "Score": 0.6746407747268677, "Text": "English", "Type": "OTHER"}, {"BeginOffset": 464, "EndOffset": 476, "Score": 0.9992197155952454, "Text": "Nabil Karoui", "Type": "PERSON"}, {"BeginOffset": 515, "EndOffset": 527, "Score": 0.9994662404060364, "Text": "Nabil Karoui", "Type": "PERSON"}, {"BeginOffset": 581, "EndOffset": 596, "Score": 0.6652442812919617, "Text": "African Manager", "Type": "ORGANIZATION"}, {"BeginOffset": 599, "EndOffset": 615, "Score": 0.8012278079986572, "Text": "09/09/2019 08:45", "Type": "DATE"}, {"BeginOffset": 674, "EndOffset": 685, "Score": 0.8724801540374756, "Text": "tunisiennes", "Type": "OTHER"}, {"BeginOffset": 689, "EndOffset": 701, "Score": 0.9975908398628235, "Text": "15 septembre", "Type": "DATE"}, {"BeginOffset": 753, "EndOffset": 781, "Score": 0.9481445550918579, "Text": "certain nombre d\u2019initiatives", "Type": "QUANTITY"}
  ],
  "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"}

//and so on ...

我需要查找并检索所有类型=PERSON 且得分 >0.7 的文件,并检索以下数据:人员和文件。

今天我的查询表达式是:

select s.Text from s3object[*].Entities[*] s where s.Type= 'PERSON' AND s.Score > 0.7;

这个输出:

[

    {
        "Text": "Trump"
    },
    {
        "Text": "Oprah winfrey
    },
    {
        "Text": "Nabil Karoui"
    },
    {
        "Text": "Nabil Karoui"
    },
    {
        "Text": "Nabil Karoui"
    },
    {
        "Text": "Nabil Karoui"
    },

]

这在一定程度上是好的,但是 我需要将每个 "Text"(人名)与其来源的文件相关联 。所以我期望的查询输出是:

[

    {
        "Text": "Trump",
        "File": "inputs/stratgies-5-start-up-qui-allient-tech-et-odorat-a634acaa-6549-4c89-93b3-8951ababa032"
    },
    {
        "Text": "Oprah winfrey,
        "File": "inputs/stratgies-5-start-up-qui-allient-tech-et-odorat-a634acaa-6549-4c89-93b3-8951ababa032"
    },
    {
        "Text": "Nabil Karoui",
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },
    {
        "Text": "Nabil Karoui"
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },
    {
        "Text": "Nabil Karoui",
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },
    {
        "Text": "Nabil Karoui",
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },

]

如何找回这个?使用 https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html 尝试了很多可能性,但 none 有效。

您共享的此页面中有一条注释:

Note Amazon S3 Select and Glacier Select queries currently do not support subqueries or joins.

我会设置 Athena for more complex queries against S3 directly (example from official doc)。另一种选择是以可以避免连接的方式重组 JSON,例如在 "Text" 级别复制 "File"。当然,您也可以在许多其他工具和格式中为这个 JSON 编制索引,以使数据可搜索/"queryable".