JSON / AWS GLUE + AWS Athena / Hive 中的结构列类型?

JSON / struct column type in AWS GLUE + AWS Athena / Hive?

我有一个从嵌套 JSON 创建的 CSV 文件。它既有常规类型的列(例如 int、string),也有 JSON 列,由嵌套的 JSONs:

创建
attributes;business_id;categories;city;days_open;latitude;longitude;name;review_count;stars;state
{"AcceptsInsurance": False, "AgesAllowed": "allages", "Alcohol": "beer_and_wine", "Ambience": {"casual": True, "classy": False, "divey": False, "hipster": False, "intimate": False, "romantic": False, "touristy": False, "trendy": False, "upscale": False}, "BYOB": False, "BikeParking": True, "BusinessAcceptsBitcoin": False, "BusinessAcceptsCreditCards": True, "BusinessParking": {"garage": False, "lot": False, "street": True, "valet": False, "validated": False}, "ByAppointmentOnly": False, "Caters": True, "CoatCheck": False, "Corkage": False, "DogsAllowed": False, "DriveThru": False, "GoodForDancing": False, "GoodForKids": False, "GoodForMeal": {"breakfast": False, "brunch": False, "dessert": False, "dinner": False, "latenight": False, "lunch": False}, "HappyHour": True, "HasTV": True, "Music": None, "NoiseLevel": "average", "Open24Hours": False, "OutdoorSeating": True, "RestaurantsAttire": "casual", "RestaurantsCounterService": False, "RestaurantsDelivery": False, "RestaurantsGoodForGroups": True, "RestaurantsPriceRange": 2, "RestaurantsReservations": False, "RestaurantsTableService": True, "RestaurantsTakeOut": True, "Smoking": "no", "WheelchairAccessible": True, "WiFi": "free"};6iYb2HFDywm3zjuRg0shjw;["Gastropubs", "Food", "Beer Gardens", "Restaurants", "Bars", "American (Traditional)", "Beer Bar", "Nightlife", "Breweries"];Boulder;["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"];40.0175444;-105.2833481;Oskar Blues Taproom;86;4.0;CO
{"AcceptsInsurance": False, "AgesAllowed": "allages", "Alcohol": "beer_and_wine", "Ambience": {"casual": True, "classy": False, "divey": False, "hipster": False, "intimate": False, "romantic": False, "touristy": False, "trendy": False, "upscale": False}, "BYOB": False, "BikeParking": False, "BusinessAcceptsBitcoin": False, "BusinessAcceptsCreditCards": True, "BusinessParking": {"garage": True, "lot": False, "street": False, "valet": False, "validated": False}, "ByAppointmentOnly": False, "Caters": True, "CoatCheck": False, "Corkage": False, "DogsAllowed": False, "DriveThru": False, "GoodForDancing": False, "GoodForKids": True, "GoodForMeal": {"breakfast": True, "brunch": False, "dessert": False, "dinner": False, "latenight": False, "lunch": True}, "HappyHour": False, "HasTV": False, "Music": None, "NoiseLevel": "average", "Open24Hours": False, "OutdoorSeating": False, "RestaurantsAttire": "casual", "RestaurantsCounterService": False, "RestaurantsDelivery": False, "RestaurantsGoodForGroups": False, "RestaurantsPriceRange": 2, "RestaurantsReservations": False, "RestaurantsTableService": True, "RestaurantsTakeOut": True, "Smoking": "no", "WheelchairAccessible": False, "WiFi": "free"};tCbdrRPZA0oiIYSmHG3J0w;["Salad", "Soup", "Sandwiches", "Delis", "Restaurants", "Cafes", "Vegetarian"];Portland;["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"];45.5889058992;-122.5933307507;Flying Elephants at PDX;126;4.0;OR
{"AcceptsInsurance": False, "AgesAllowed": "allages", "Alcohol": "none", "Ambience": None, "BYOB": False, "BikeParking": False, "BusinessAcceptsBitcoin": False, "BusinessAcceptsCreditCards": True, "BusinessParking": {"garage": False, "lot": False, "street": True, "valet": False, "validated": False}, "ByAppointmentOnly": False, "Caters": False, "CoatCheck": False, "Corkage": False, "DogsAllowed": False, "DriveThru": False, "GoodForDancing": False, "GoodForKids": False, "GoodForMeal": None, "HappyHour": False, "HasTV": False, "Music": None, "NoiseLevel": "average", "Open24Hours": False, "OutdoorSeating": False, "RestaurantsAttire": "casual", "RestaurantsCounterService": False, "RestaurantsDelivery": False, "RestaurantsGoodForGroups": False, "RestaurantsPriceRange": 2, "RestaurantsReservations": True, "RestaurantsTableService": True, "RestaurantsTakeOut": False, "Smoking": "no", "WheelchairAccessible": False, "WiFi": "no"};bvN78flM8NLprQ1a1y5dRg;["Antiques", "Fashion", "Used", "Vintage & Consignment", "Shopping", "Furniture Stores", "Home & Garden"];Portland;["Thursday", "Friday", "Saturday", "Sunday"];45.5119069956;-122.6136928797;The Reclaimory;13;4.5;OR

能否使用 AWS Glue 处理此文件以输入 AWS Athena / Hive(在 Athena 内部使用)?特别是,如何为 JSON 列指定数据类型?我必须手动执行此操作吗? JSON 写得好吗,还是应该重新格式化?

将尽力回答您的所有问题。
能否使用 AWS Glue 处理此文件以输入 AWS Athena / Hive(在 Athena 内部使用)?
应该。如果您正确构建配置单元 table,任何 csv 文件都可以上传到那里。

如何为 JSON 列指定数据类型?
在配置单元中,您可以存储为 string。查看您的 json 结构,您可以使用这样的表达式轻松访问元素 - get_json_object(json_col_str,'$.BusinessParking.garage').

我必须手动执行此操作吗?
我想是的,除非你有一些自动 DDL 创建实用程序。您可以将示例行放在 xl 中,并轻松找出 table 结构。

JSON写得好吗,还是应该重新格式化?
从你给出的例子来看,我检查了最后一行, json 对象对我来说似乎没问题。我还使用 https://jsonformatter.curiousconcept.com/ 进行了检查,它以一种漂亮的格式对其进行验证和格式化。如有不符,可以使用。

只要 JSON 列中没有分号,您就应该能够使用 Athena 查询此数据。将您的 table 定义为 CSV,以分号作为分隔符,并使用 string 作为 JSON 列的类型。

查询这个table时可以用the JSON functions查询JSON列,例如:

SELECT json_extract_scalar(attributes, '$.AcceptsInsurance')
…