Kusto 查询将时间序列数据聚类到 'sessions' 并分配 sessionId
Kusto query to cluster time-series data into 'sessions' and assign sessionId
我有以下格式的时间序列数据:
数据表(元素:字符串,Timestamp:datetime,Value:long)
对于每个 Element 都有一个 Timestamp 和关联的 Value 时间戳。如果一个元素的 2 个连续时间戳相隔 X 分钟以上,则它们被认为是不同会话的一部分(较小的时间戳是上一个会话的结束,较大的时间戳是新会话的开始)。对于每个这样的会话,我想计算 SessionId(基于会话开始或随机 guid)、会话开始和会话结束。
示例:(考虑与之前间隔 30 分钟的时间戳被视为新会话的开始)
输入:
Element Timestamp Value
Element-A 2022-03-25 06:15:00 10
Element-A 2022-03-25 06:30:00 10
Element-A 2022-03-25 06:45:00 10
Element-A 2022-03-25 08:15:00 10
Element-A 2022-03-25 08:30:00 10
Element-A 2022-03-25 08:45:00 10
Element-B 2022-03-25 07:15:00 10
Element-B 2022-03-25 07:30:00 10
Element-B 2022-03-25 07:45:00 10
Element-B 2022-03-25 09:15:00 10
Element-B 2022-03-25 09:30:00 10
Element-B 2022-03-25 09:45:00 10
预期输出:
Element Timestamp value SessionId SessionStart SessionEnd
Element-A 2022-03-25 06:15:00 10 guid-1 2022-03-25 06:15:00 2022-03-25 06:45:00
Element-A 2022-03-25 06:30:00 10 guid-1 2022-03-25 06:15:00 2022-03-25 06:45:00
Element-A 2022-03-25 06:45:00 10 guid-1 2022-03-25 06:15:00 2022-03-25 06:45:00
Element-A 2022-03-25 08:15:00 10 guid-2 2022-03-25 08:15:00 2022-03-25 08:45:00
Element-A 2022-03-25 08:30:00 10 guid-2 2022-03-25 08:15:00 2022-03-25 08:45:00
Element-A 2022-03-25 08:45:00 10 guid-2 2022-03-25 08:15:00 2022-03-25 08:45:00
Element-B 2022-03-25 07:15:00 10 guid-3 2022-03-25 07:15:00 2022-03-25 07:45:00
Element-B 2022-03-25 07:30:00 10 guid-3 2022-03-25 07:15:00 2022-03-25 07:45:00
Element-B 2022-03-25 07:45:00 10 guid-3 2022-03-25 07:15:00 2022-03-25 07:45:00
Element-B 2022-03-25 09:15:00 10 guid-4 2022-03-25 09:15:00 2022-03-25 09:45:00
Element-B 2022-03-25 09:30:00 10 guid-4 2022-03-25 09:15:00 2022-03-25 09:45:00
Element-B 2022-03-25 09:45:00 10 guid-4 2022-03-25 09:15:00 2022-03-25 09:45:00
数据量大。请建议使用性能高效的查询来实现这一点。
根据 OP 评论,添加仅包含摘要部分的解决方案。
请注意 -
Element
和 SessionIndex
的组合是独一无二的,可以与 SessionId
互换使用(基于 new_guid()
)
- 由于此解决方案基于汇总,因此每个会话可以轻松收集其他信息,例如每个会话的事件数、min/max/avg 每个会话的值、值高于 x 的事件数(基于
count_if
)等
datatable (Element:string, Timestamp:datetime, Value:int)
[
"Element-A" ,"2022-03-25 06:15:00" ,10
,"Element-A" ,"2022-03-25 06:30:00" ,10
,"Element-A" ,"2022-03-25 06:45:00" ,10
,"Element-A" ,"2022-03-25 08:15:00" ,10
,"Element-A" ,"2022-03-25 08:30:00" ,10
,"Element-A" ,"2022-03-25 08:45:00" ,10
,"Element-B" ,"2022-03-25 07:15:00" ,10
,"Element-B" ,"2022-03-25 07:30:00" ,10
,"Element-B" ,"2022-03-25 07:45:00" ,10
,"Element-B" ,"2022-03-25 09:15:00" ,10
,"Element-B" ,"2022-03-25 09:30:00" ,10
,"Element-B" ,"2022-03-25 09:45:00" ,10
]
| partition hint.strategy=shuffle by Element
(
order by Timestamp asc
| extend SessionIndex = row_cumsum(iff(Timestamp - prev(Timestamp) > 30m, 1, 0))
| summarize SessionStart = min(Timestamp), SessionEnd = max(Timestamp) by SessionIndex
| extend Element, SessionId = new_guid()
| project-reorder Element
)
Element
SessionIndex
SessionStart
SessionEnd
SessionId
Element-A
0
2022-03-25T06:15:00Z
2022-03-25T06:45:00Z
5d43e356-9aae-40cb-9e2e-bd2741cc9934
Element-B
0
2022-03-25T07:15:00Z
2022-03-25T07:45:00Z
df83db35-c292-4bee-a14e-0ebc2b7ef6b5
Element-A
1
2022-03-25T08:15:00Z
2022-03-25T08:45:00Z
40dbaa02-b110-4e99-8696-2505a2995553
Element-B
1
2022-03-25T09:15:00Z
2022-03-25T09:45:00Z
59d6fdeb-a596-4fab-97e5-d9057519c6c0
你可以从这里开始。
您的数据的人口统计特征(记录数、元素数、每个元素的会话数)将决定该解决方案针对您的特定需求的优化程度。
datatable (Element:string, Timestamp:datetime, Value:int)
[
"Element-A" ,"2022-03-25 06:15:00" ,10
,"Element-A" ,"2022-03-25 06:30:00" ,10
,"Element-A" ,"2022-03-25 06:45:00" ,10
,"Element-A" ,"2022-03-25 08:15:00" ,10
,"Element-A" ,"2022-03-25 08:30:00" ,10
,"Element-A" ,"2022-03-25 08:45:00" ,10
,"Element-B" ,"2022-03-25 07:15:00" ,10
,"Element-B" ,"2022-03-25 07:30:00" ,10
,"Element-B" ,"2022-03-25 07:45:00" ,10
,"Element-B" ,"2022-03-25 09:15:00" ,10
,"Element-B" ,"2022-03-25 09:30:00" ,10
,"Element-B" ,"2022-03-25 09:45:00" ,10
]
| partition hint.strategy=shuffle by Element
(
order by Timestamp asc
| extend SessionIndex = row_cumsum(iff(Timestamp - prev(Timestamp) > 30m, 1, 0))
| summarize min(Timestamp), max(Timestamp), make_list(Timestamp), make_list(Value) by SessionIndex
| extend SessionId = new_guid()
| mv-apply Timestamp = list_Timestamp to typeof(datetime), Value = list_Value to typeof(int) on (project Timestamp, Value)
| project Element, Timestamp, Value, SessionStart = min_Timestamp, SessionEnd = max_Timestamp, SessionId, SessionIndex
)
Element
Timestamp
Value
SessionStart
SessionEnd
SessionId
SessionIndex
Element-A
2022-03-25T06:15:00Z
10
2022-03-25T06:15:00Z
2022-03-25T06:45:00Z
1ac146b1-24fa-427e-b2b3-663d83297d4c
0
Element-A
2022-03-25T06:30:00Z
10
2022-03-25T06:15:00Z
2022-03-25T06:45:00Z
1ac146b1-24fa-427e-b2b3-663d83297d4c
0
Element-A
2022-03-25T06:45:00Z
10
2022-03-25T06:15:00Z
2022-03-25T06:45:00Z
1ac146b1-24fa-427e-b2b3-663d83297d4c
0
Element-B
2022-03-25T07:15:00Z
10
2022-03-25T07:15:00Z
2022-03-25T07:45:00Z
cbef109a-73bc-4067-9e7f-ebada6aa444e
0
Element-B
2022-03-25T07:30:00Z
10
2022-03-25T07:15:00Z
2022-03-25T07:45:00Z
cbef109a-73bc-4067-9e7f-ebada6aa444e
0
Element-B
2022-03-25T07:45:00Z
10
2022-03-25T07:15:00Z
2022-03-25T07:45:00Z
cbef109a-73bc-4067-9e7f-ebada6aa444e
0
Element-A
2022-03-25T08:15:00Z
10
2022-03-25T08:15:00Z
2022-03-25T08:45:00Z
c53fba2e-b82e-418c-9380-1e732be8fcb5
1
Element-A
2022-03-25T08:30:00Z
10
2022-03-25T08:15:00Z
2022-03-25T08:45:00Z
c53fba2e-b82e-418c-9380-1e732be8fcb5
1
Element-A
2022-03-25T08:45:00Z
10
2022-03-25T08:15:00Z
2022-03-25T08:45:00Z
c53fba2e-b82e-418c-9380-1e732be8fcb5
1
Element-B
2022-03-25T09:15:00Z
10
2022-03-25T09:15:00Z
2022-03-25T09:45:00Z
4ab89211-4378-45d3-8ac7-a570942e2807
1
Element-B
2022-03-25T09:30:00Z
10
2022-03-25T09:15:00Z
2022-03-25T09:45:00Z
4ab89211-4378-45d3-8ac7-a570942e2807
1
Element-B
2022-03-25T09:45:00Z
10
2022-03-25T09:15:00Z
2022-03-25T09:45:00Z
4ab89211-4378-45d3-8ac7-a570942e2807
1
我有以下格式的时间序列数据:
数据表(元素:字符串,Timestamp:datetime,Value:long)
对于每个 Element 都有一个 Timestamp 和关联的 Value 时间戳。如果一个元素的 2 个连续时间戳相隔 X 分钟以上,则它们被认为是不同会话的一部分(较小的时间戳是上一个会话的结束,较大的时间戳是新会话的开始)。对于每个这样的会话,我想计算 SessionId(基于会话开始或随机 guid)、会话开始和会话结束。
示例:(考虑与之前间隔 30 分钟的时间戳被视为新会话的开始)
输入:
Element Timestamp Value
Element-A 2022-03-25 06:15:00 10
Element-A 2022-03-25 06:30:00 10
Element-A 2022-03-25 06:45:00 10
Element-A 2022-03-25 08:15:00 10
Element-A 2022-03-25 08:30:00 10
Element-A 2022-03-25 08:45:00 10
Element-B 2022-03-25 07:15:00 10
Element-B 2022-03-25 07:30:00 10
Element-B 2022-03-25 07:45:00 10
Element-B 2022-03-25 09:15:00 10
Element-B 2022-03-25 09:30:00 10
Element-B 2022-03-25 09:45:00 10
预期输出:
Element Timestamp value SessionId SessionStart SessionEnd
Element-A 2022-03-25 06:15:00 10 guid-1 2022-03-25 06:15:00 2022-03-25 06:45:00
Element-A 2022-03-25 06:30:00 10 guid-1 2022-03-25 06:15:00 2022-03-25 06:45:00
Element-A 2022-03-25 06:45:00 10 guid-1 2022-03-25 06:15:00 2022-03-25 06:45:00
Element-A 2022-03-25 08:15:00 10 guid-2 2022-03-25 08:15:00 2022-03-25 08:45:00
Element-A 2022-03-25 08:30:00 10 guid-2 2022-03-25 08:15:00 2022-03-25 08:45:00
Element-A 2022-03-25 08:45:00 10 guid-2 2022-03-25 08:15:00 2022-03-25 08:45:00
Element-B 2022-03-25 07:15:00 10 guid-3 2022-03-25 07:15:00 2022-03-25 07:45:00
Element-B 2022-03-25 07:30:00 10 guid-3 2022-03-25 07:15:00 2022-03-25 07:45:00
Element-B 2022-03-25 07:45:00 10 guid-3 2022-03-25 07:15:00 2022-03-25 07:45:00
Element-B 2022-03-25 09:15:00 10 guid-4 2022-03-25 09:15:00 2022-03-25 09:45:00
Element-B 2022-03-25 09:30:00 10 guid-4 2022-03-25 09:15:00 2022-03-25 09:45:00
Element-B 2022-03-25 09:45:00 10 guid-4 2022-03-25 09:15:00 2022-03-25 09:45:00
数据量大。请建议使用性能高效的查询来实现这一点。
根据 OP 评论,添加仅包含摘要部分的解决方案。 请注意 -
Element
和SessionIndex
的组合是独一无二的,可以与SessionId
互换使用(基于new_guid()
)- 由于此解决方案基于汇总,因此每个会话可以轻松收集其他信息,例如每个会话的事件数、min/max/avg 每个会话的值、值高于 x 的事件数(基于
count_if
)等
datatable (Element:string, Timestamp:datetime, Value:int)
[
"Element-A" ,"2022-03-25 06:15:00" ,10
,"Element-A" ,"2022-03-25 06:30:00" ,10
,"Element-A" ,"2022-03-25 06:45:00" ,10
,"Element-A" ,"2022-03-25 08:15:00" ,10
,"Element-A" ,"2022-03-25 08:30:00" ,10
,"Element-A" ,"2022-03-25 08:45:00" ,10
,"Element-B" ,"2022-03-25 07:15:00" ,10
,"Element-B" ,"2022-03-25 07:30:00" ,10
,"Element-B" ,"2022-03-25 07:45:00" ,10
,"Element-B" ,"2022-03-25 09:15:00" ,10
,"Element-B" ,"2022-03-25 09:30:00" ,10
,"Element-B" ,"2022-03-25 09:45:00" ,10
]
| partition hint.strategy=shuffle by Element
(
order by Timestamp asc
| extend SessionIndex = row_cumsum(iff(Timestamp - prev(Timestamp) > 30m, 1, 0))
| summarize SessionStart = min(Timestamp), SessionEnd = max(Timestamp) by SessionIndex
| extend Element, SessionId = new_guid()
| project-reorder Element
)
Element | SessionIndex | SessionStart | SessionEnd | SessionId |
---|---|---|---|---|
Element-A | 0 | 2022-03-25T06:15:00Z | 2022-03-25T06:45:00Z | 5d43e356-9aae-40cb-9e2e-bd2741cc9934 |
Element-B | 0 | 2022-03-25T07:15:00Z | 2022-03-25T07:45:00Z | df83db35-c292-4bee-a14e-0ebc2b7ef6b5 |
Element-A | 1 | 2022-03-25T08:15:00Z | 2022-03-25T08:45:00Z | 40dbaa02-b110-4e99-8696-2505a2995553 |
Element-B | 1 | 2022-03-25T09:15:00Z | 2022-03-25T09:45:00Z | 59d6fdeb-a596-4fab-97e5-d9057519c6c0 |
你可以从这里开始。
您的数据的人口统计特征(记录数、元素数、每个元素的会话数)将决定该解决方案针对您的特定需求的优化程度。
datatable (Element:string, Timestamp:datetime, Value:int)
[
"Element-A" ,"2022-03-25 06:15:00" ,10
,"Element-A" ,"2022-03-25 06:30:00" ,10
,"Element-A" ,"2022-03-25 06:45:00" ,10
,"Element-A" ,"2022-03-25 08:15:00" ,10
,"Element-A" ,"2022-03-25 08:30:00" ,10
,"Element-A" ,"2022-03-25 08:45:00" ,10
,"Element-B" ,"2022-03-25 07:15:00" ,10
,"Element-B" ,"2022-03-25 07:30:00" ,10
,"Element-B" ,"2022-03-25 07:45:00" ,10
,"Element-B" ,"2022-03-25 09:15:00" ,10
,"Element-B" ,"2022-03-25 09:30:00" ,10
,"Element-B" ,"2022-03-25 09:45:00" ,10
]
| partition hint.strategy=shuffle by Element
(
order by Timestamp asc
| extend SessionIndex = row_cumsum(iff(Timestamp - prev(Timestamp) > 30m, 1, 0))
| summarize min(Timestamp), max(Timestamp), make_list(Timestamp), make_list(Value) by SessionIndex
| extend SessionId = new_guid()
| mv-apply Timestamp = list_Timestamp to typeof(datetime), Value = list_Value to typeof(int) on (project Timestamp, Value)
| project Element, Timestamp, Value, SessionStart = min_Timestamp, SessionEnd = max_Timestamp, SessionId, SessionIndex
)
Element | Timestamp | Value | SessionStart | SessionEnd | SessionId | SessionIndex |
---|---|---|---|---|---|---|
Element-A | 2022-03-25T06:15:00Z | 10 | 2022-03-25T06:15:00Z | 2022-03-25T06:45:00Z | 1ac146b1-24fa-427e-b2b3-663d83297d4c | 0 |
Element-A | 2022-03-25T06:30:00Z | 10 | 2022-03-25T06:15:00Z | 2022-03-25T06:45:00Z | 1ac146b1-24fa-427e-b2b3-663d83297d4c | 0 |
Element-A | 2022-03-25T06:45:00Z | 10 | 2022-03-25T06:15:00Z | 2022-03-25T06:45:00Z | 1ac146b1-24fa-427e-b2b3-663d83297d4c | 0 |
Element-B | 2022-03-25T07:15:00Z | 10 | 2022-03-25T07:15:00Z | 2022-03-25T07:45:00Z | cbef109a-73bc-4067-9e7f-ebada6aa444e | 0 |
Element-B | 2022-03-25T07:30:00Z | 10 | 2022-03-25T07:15:00Z | 2022-03-25T07:45:00Z | cbef109a-73bc-4067-9e7f-ebada6aa444e | 0 |
Element-B | 2022-03-25T07:45:00Z | 10 | 2022-03-25T07:15:00Z | 2022-03-25T07:45:00Z | cbef109a-73bc-4067-9e7f-ebada6aa444e | 0 |
Element-A | 2022-03-25T08:15:00Z | 10 | 2022-03-25T08:15:00Z | 2022-03-25T08:45:00Z | c53fba2e-b82e-418c-9380-1e732be8fcb5 | 1 |
Element-A | 2022-03-25T08:30:00Z | 10 | 2022-03-25T08:15:00Z | 2022-03-25T08:45:00Z | c53fba2e-b82e-418c-9380-1e732be8fcb5 | 1 |
Element-A | 2022-03-25T08:45:00Z | 10 | 2022-03-25T08:15:00Z | 2022-03-25T08:45:00Z | c53fba2e-b82e-418c-9380-1e732be8fcb5 | 1 |
Element-B | 2022-03-25T09:15:00Z | 10 | 2022-03-25T09:15:00Z | 2022-03-25T09:45:00Z | 4ab89211-4378-45d3-8ac7-a570942e2807 | 1 |
Element-B | 2022-03-25T09:30:00Z | 10 | 2022-03-25T09:15:00Z | 2022-03-25T09:45:00Z | 4ab89211-4378-45d3-8ac7-a570942e2807 | 1 |
Element-B | 2022-03-25T09:45:00Z | 10 | 2022-03-25T09:15:00Z | 2022-03-25T09:45:00Z | 4ab89211-4378-45d3-8ac7-a570942e2807 | 1 |