Athena/Glue - 解析简单 JSON（但将其视为 CSV）

Question

基于我之前的，我构建了一个简单的 JSON 文件，每行一个“行”。我仍然感到震惊，因为这无效 JSON，因为它周围没有方括号。

一个数据文件：

{"firstName": "Neal",    "lastName": "Walters", "city": "Irving", "state", "TX"  }
{"firstName": "Fred",    "lastName": "Flintstone",   "city": "Bedrock",  "state", "TX"}
{"firstName": "Barney",  "lastName": "Rubble",   "city": "Stillwater",   "state", "OK"}

在运行通过GLUE之后，这是我的第一个查询，非常令人失望。

下面是它生成的架构。从中我们可以看出，GLUE 显然认为这是一个 CSV 而不是 JSON。在设置询问文件类型的 Glue 爬虫时，我没有看到任何选项，我是不是在某个隐藏选项的某处遗漏了这个？

对于像这样的简单示例，我可能可以手动修复架构。但是 GLUE 真的是一个如此糟糕的解析器吗？在我的实际应用程序中，我有大约 150 个字段，因此理想情况下它会为我生成所有列。

CREATE EXTERNAL TABLE `flattb_testflatjson`(
  `col0` string, 
  `col1` string, 
  `col2` string, 
  `col3` string, 
  `col4` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://relatix/polygonData/history/testflatjson/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='FlatJsonTestForAthena', 
  'areColumnsQuoted'='false', 
  'averageRecordSize'='83', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',', 
  'objectCount'='1', 
  'recordCount'='3', 
  'sizeKey'='255', 
  'typeOfData'='file')

Answer 1

胶水一般来说很糟糕，但这实际上让我感到惊讶，直到我看到 Achyut 的评论：你的 JSON 格式错误。

JSON是一种数据格式，不是文件格式。没有格式正确的 JSON 文件这样的东西，因为规范没有涵盖这一点。 Spark、Hadoop 和 Athena 等工具要求 JSON 数据在文件中，每行一个文档，因为这样可以轻松高效地处理数据。有时这被称为“JSON 流”（这不是一个好名字，因为我们谈论的是文件），或“line-delimited JSON”。

我认为您最好手动创建 table。您可以在文档中找到一个示例作为起点：https://docs.aws.amazon.com/athena/latest/ug/json-serde.html

您还应该使用适当的 JSON 序列化库来编写 JSON，这样您就不会出现语法错误，例如意外的逗号而不是冒号。

Answer 2

您可能想要更新 Glue table 属性 - 特别是

'classification'='csv',

至

'classification'='json',

Athena/Glue - 解析简单 JSON（但将其视为 CSV）

Athena/Glue - Parsing simple JSON (but treats it like a CSV)

amazon-athena

aws-glue