Athena 嵌套结构查询 - 如何在 SQL 中查询 Value_counts
Athena nested Struct Querying - how to query Value_counts in SQL
我在 AWS Athena 中有一个很大的嵌套结构。这是 table 中名为 "petowners" 的一列:
{_id=5e6b531a412345e0e86aeae0, status=NotAnalyzed, animalcategories=[{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}], ...many-other-values}
我在找:
- 相当于列中的 python 函数
value_counts
。
意思是我正在寻找将输出的 SQL Athena 命令
这一行:[mammals:1, birds:2, UnknownField:4]
- 一种查询聚合的方法-创建总数的直方图
每个主人的宠物数量
row = 7
- 在'animalycategories'
中有多少宠物主人拥有UnknownField
- 整个table有多少种动物?
这是解决方案的开头:
我们称 table "entire_table"
SELECT t.entire_table._id,
t.petowners.animalcategories,
ac.categoryname,
ac.matches
FROM entire_table t, UNNEST(t.petowners.animalcategories) AS t(ac)
此查询将输出一个 table,其中包含名为 "categoryname" 和 "matches" 的列,其中每行重复的类别名称与每个 user_id 的类别名称一样多:
| _id | animalcategories | categoryname | matches |
|--------------------------|---------------------------------------------------------------------------------------------------------------|--------------|---------|
| 5e6b531a412345e0e86aeae0 | [{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}] | mammals | 1 |
| 5e6b531a412345e0e86aeae0 | [{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}] | birds | 2 |
| 5e6b531a412345e0e86aeae0 | [{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}] | UnknownField | 4 |
这里是最相关的链接(按重要性排序):
- A similar question in Whosebug
- Presto documentation showing Lambda Expressions and Functions which are another way to work with nested structs
- AWS explaining about "Querying Arrays with Complex Types and Nested Structures"
- A good blog read from Joe Celko about "Nesting levels in SQL"
- SQL original paper from 1970 IBM research by E.F.CODD added for the sake of "being pretty" and as a token of respect
- SQL pdf HUGE manual - a bit of an overkill but under "Query expressions" at page 323 I look for the answers I can't seem to find anywhere else
我遇到了一些不太有用的链接,我觉得值得一提,为了进行全面审查,我将在此处添加它们:
- AWS Athena forum - many good questions, yet sadly few answers
- Presto google group - focused on the engineering part, not many answers as well
我希望有一天有人会发现这个 post 有用,并通过浏览几个小时的网络来寻找我必须经历的答案,从而为自己找到一条捷径。祝你好运。
我在 AWS Athena 中有一个很大的嵌套结构。这是 table 中名为 "petowners" 的一列:
{_id=5e6b531a412345e0e86aeae0, status=NotAnalyzed, animalcategories=[{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}], ...many-other-values}
我在找:
- 相当于列中的 python 函数
value_counts
。 意思是我正在寻找将输出的 SQL Athena 命令 这一行:[mammals:1, birds:2, UnknownField:4]
- 一种查询聚合的方法-创建总数的直方图
每个主人的宠物数量
row = 7
- 在'animalycategories' 中有多少宠物主人拥有
- 整个table有多少种动物?
UnknownField
这是解决方案的开头: 我们称 table "entire_table"
SELECT t.entire_table._id,
t.petowners.animalcategories,
ac.categoryname,
ac.matches
FROM entire_table t, UNNEST(t.petowners.animalcategories) AS t(ac)
此查询将输出一个 table,其中包含名为 "categoryname" 和 "matches" 的列,其中每行重复的类别名称与每个 user_id 的类别名称一样多:
| _id | animalcategories | categoryname | matches | |--------------------------|---------------------------------------------------------------------------------------------------------------|--------------|---------| | 5e6b531a412345e0e86aeae0 | [{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}] | mammals | 1 | | 5e6b531a412345e0e86aeae0 | [{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}] | birds | 2 | | 5e6b531a412345e0e86aeae0 | [{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}] | UnknownField | 4 |
这里是最相关的链接(按重要性排序):
- A similar question in Whosebug
- Presto documentation showing Lambda Expressions and Functions which are another way to work with nested structs
- AWS explaining about "Querying Arrays with Complex Types and Nested Structures"
- A good blog read from Joe Celko about "Nesting levels in SQL"
- SQL original paper from 1970 IBM research by E.F.CODD added for the sake of "being pretty" and as a token of respect
- SQL pdf HUGE manual - a bit of an overkill but under "Query expressions" at page 323 I look for the answers I can't seem to find anywhere else
我遇到了一些不太有用的链接,我觉得值得一提,为了进行全面审查,我将在此处添加它们:
- AWS Athena forum - many good questions, yet sadly few answers
- Presto google group - focused on the engineering part, not many answers as well
我希望有一天有人会发现这个 post 有用,并通过浏览几个小时的网络来寻找我必须经历的答案,从而为自己找到一条捷径。祝你好运。