可以在 ArangoDB 中编写查询以聚合连接文档中的值吗?
Can a query be written in ArangoDB to aggregate values within joined documents?
假设您有一个包含普通会员和高级会员的电影订阅服务。
这是用户 activity 生成并作为文档存储在集合中的数据示例:
[
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 1
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 1,
"elapsed": 200
},
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 2
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 2,
"elapsed": 500
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 3
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 3,
"elapsed": 10
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 4
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 4,
"elapsed": 100
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 5
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 5,
"elapsed": 5
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 5,
"elapsed": 25
}
]
可以看到有两个“eventType”:
“sessionInfo”文档,包含整个系统共有的信息
用户会话
“mediaPlay”文件存储多少秒的a
已观看电影。
(每个“mediaPlay”事件都包含 sessionGroupID,因此它可以与该会话相关联。)
问题 #1:
鉴于总共有数千万个文档,您将如何编写一个查询来计算每部电影的总观看时间,并按用户类型分组?
想要查询的结果:
premium users - total of "elapsed":
xmen: 500
starwars: 200
normal users - total of "elapsed":
xmen: 115
starwars: 25
问题 #2:
如果数据结构不适合此类查询,那么理想的结构是什么?
- 例如,将 "mediaPlay" 事件作为嵌套数组嵌套在每个 "sessionInfo" 文档中会更好吗?
像这样?
[
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 1,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 1,
"elapsed": 200
}
]
},
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 2,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 2,
"elapsed": 500
}
]
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 3,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 3,
"elapsed": 10
}
]
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 4,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 4,
"elapsed": 100
}
]
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 5,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 5,
"elapsed": 5
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 5,
"elapsed": 25
}
]
}
]
感谢所有指导和建议!
以下查询遍历 collection 并收集按 userTypes 分组的所有 session ID。然后它创建一个子查询,迭代 collection 并收集所有电影和经过时间的总和,其中 eventType
是 "mediaPlay" 并且收集的 session 包含 sessionGroupID
.
@@coll
是一个 bind parameter,其中包含您的 collection 姓名。
FOR doc IN @@coll
FILTER doc.eventType == "sessionInfo"
COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
RETURN {
"userTypes" : userTypes,
"movies" : (
FOR event IN @@coll
FILTER event.sessionGroupID IN sessions
FILTER event.eventType == "mediaPlay"
COLLECT movie = event.productSKU INTO elapsed = event.elapsed
RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
)
}
本次查询结果为:
[
{
"userTypes": "normal",
"movies": [
{
"movie": "starwars",
"elapsed": 25
},
{
"movie": "xmen",
"elapsed": 115
}
]
},
{
"userTypes": "premium",
"movies": [
{
"movie": "starwars",
"elapsed": 200
},
{
"movie": "xmen",
"elapsed": 500
}
]
}
]
关于你的第二个问题。嵌套 arrays/objects 不会优化此查询,但您应该将数据拆分为两个 collection。每个 eventType
对应一个(例如,将 collection 命名为事件类型 sessionInfo
和 mediaPlay
)。这减少了所需的过滤器语句的数量,更重要的是,它允许您分别查询 sessionInfos 和 mediaPlays,从而大大提高您的性能。
查询将如下所示:
FOR doc IN sessionInfo
COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
RETURN {
"userTypes" : userTypes,
"movies" : (
FOR event IN mediaPlay
FILTER event.sessionGroupID IN sessions
COLLECT movie = event.productSKU INTO elapsed = event.elapsed
RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
)
}
假设您有一个包含普通会员和高级会员的电影订阅服务。
这是用户 activity 生成并作为文档存储在集合中的数据示例:
[
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 1
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 1,
"elapsed": 200
},
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 2
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 2,
"elapsed": 500
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 3
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 3,
"elapsed": 10
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 4
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 4,
"elapsed": 100
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 5
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 5,
"elapsed": 5
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 5,
"elapsed": 25
}
]
可以看到有两个“eventType”:
“sessionInfo”文档,包含整个系统共有的信息 用户会话
“mediaPlay”文件存储多少秒的a 已观看电影。
(每个“mediaPlay”事件都包含 sessionGroupID,因此它可以与该会话相关联。)
问题 #1:
鉴于总共有数千万个文档,您将如何编写一个查询来计算每部电影的总观看时间,并按用户类型分组?
想要查询的结果:
premium users - total of "elapsed":
xmen: 500
starwars: 200
normal users - total of "elapsed":
xmen: 115
starwars: 25
问题 #2:
如果数据结构不适合此类查询,那么理想的结构是什么?
- 例如,将 "mediaPlay" 事件作为嵌套数组嵌套在每个 "sessionInfo" 文档中会更好吗?
像这样?
[
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 1,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 1,
"elapsed": 200
}
]
},
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 2,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 2,
"elapsed": 500
}
]
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 3,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 3,
"elapsed": 10
}
]
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 4,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 4,
"elapsed": 100
}
]
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 5,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 5,
"elapsed": 5
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 5,
"elapsed": 25
}
]
}
]
感谢所有指导和建议!
以下查询遍历 collection 并收集按 userTypes 分组的所有 session ID。然后它创建一个子查询,迭代 collection 并收集所有电影和经过时间的总和,其中 eventType
是 "mediaPlay" 并且收集的 session 包含 sessionGroupID
.
@@coll
是一个 bind parameter,其中包含您的 collection 姓名。
FOR doc IN @@coll
FILTER doc.eventType == "sessionInfo"
COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
RETURN {
"userTypes" : userTypes,
"movies" : (
FOR event IN @@coll
FILTER event.sessionGroupID IN sessions
FILTER event.eventType == "mediaPlay"
COLLECT movie = event.productSKU INTO elapsed = event.elapsed
RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
)
}
本次查询结果为:
[
{
"userTypes": "normal",
"movies": [
{
"movie": "starwars",
"elapsed": 25
},
{
"movie": "xmen",
"elapsed": 115
}
]
},
{
"userTypes": "premium",
"movies": [
{
"movie": "starwars",
"elapsed": 200
},
{
"movie": "xmen",
"elapsed": 500
}
]
}
]
关于你的第二个问题。嵌套 arrays/objects 不会优化此查询,但您应该将数据拆分为两个 collection。每个 eventType
对应一个(例如,将 collection 命名为事件类型 sessionInfo
和 mediaPlay
)。这减少了所需的过滤器语句的数量,更重要的是,它允许您分别查询 sessionInfos 和 mediaPlays,从而大大提高您的性能。
查询将如下所示:
FOR doc IN sessionInfo
COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
RETURN {
"userTypes" : userTypes,
"movies" : (
FOR event IN mediaPlay
FILTER event.sessionGroupID IN sessions
COLLECT movie = event.productSKU INTO elapsed = event.elapsed
RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
)
}