如何使用两列在 KQL 中使用 arg_max() 汇总数据？

Question

我有一个 table 包含以下列：

ID：标识导入的文档（想想文件名）。这对于 ImportId 和 ImportTime 的组合是唯一的。
SomeData：一些数据列。在真正的table中有更多的列。
ImportId：格式为YYYY-MM-DD的ID，例如“2022-05-14”（这是一个字符串列）
ImportTime：完成导入的日期和时间（这是一个字符串列）

RowNum 不是 table 的一部分，但在这里使用是为了能够引用 records/rows。

RowNum	ID	Value	ImportId	ImportTime
1	A	Doc A content as of May 11, 2022	2022-05-11	2022-05-11 13:00
2	B	Doc B content as of May 11, 2022	2022-05-11	2022-05-11 13:00
3	A	Doc A content as of May 11, 2022	2022-05-11	2022-05-11 17:00
4	B	Doc B content as of May 11, 2022	2022-05-11	2022-05-11 17:00
5	A	Doc A content as of May 14, 2022	2022-05-14	2022-05-17 08:00
6	B	Doc B content as of May 14, 2022	2022-05-14	2022-05-17 08:00
7	A	Doc A content as of May 14, 2022	2022-05-14	2022-05-17 10:00
8	B	Doc B content as of May 14, 2022	2022-05-14	2022-05-17 10:00
9	A	Doc A content as of May 11, 2022	2022-05-11	2022-05-18 15:00
10	B	Doc B content as of May 11, 2022	2022-05-11	2022-05-18 15:00

在上面的 table 中，有 5 月 11 日的三个导入（ImportId =“2022-05-11”）和 5 月 14 日的两个数据导入（ImportId =“2022-05-14”） .
最近一次导入运行（导入时间）是在 2022-05-18 15:00
最新的导入时间不一定与最新的导入数据相关。在我上面的示例中，某人运行于 5 月 18 日在 15:00 导入，但导入了 5 月 11 日的目录状态（ImportId =“2022-05-11”）。

挑战：我需要使用最新的 ImportId（即“2022-05-14”）和最新的 ImportTime（即“2022-05-18 15:00").

对于上面的示例，结果应包含 ImportId 为“2022-05-14”和 ImportTime 为“2022-05-17 10:00”的两行（行号 7 和 8）。

我尝试了什么：

方法一

我在 ImportTime 上使用了 arg_max():

T
| summarize arg_max(ImportTime, *) by ID

这 returns 最后两行（9 和 10），其中 ImportId 为“2022-05-11”。这不是我想要的，因为最新的 ImportId 是“2022-05-14”。

方法二

如果我改用 arg_max(ImportId, *) by ID，我得到的是“2022-05-14”（第 5 和 6 行），而不是最新的 ImportTime.

方法 3

我将 ImportTime 和 ImportId 组合成一个扩展列，并在其上应用了 arg_max()。似乎有效，但我不确定它是否在所有情况下都正确？

T
| extend Combined = strcat(ImportId, ImportTime)
| summarize arg_max(Combined, *) by ID

这 returns 在“2022-05-17 10:00”的导入时间“2022-05-14”的预期第 7 行和第 8 行。

有更好的选择吗？

Answer 1

查看 top-nested operator:

datatable(Value:string, ImportId:datetime, ImportTime:datetime)
[
    "A",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
    "A",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
    "A",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
    "B",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
    "A",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
    "B",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
    "A",    datetime(2022-05-11),   datetime(2022-05-18 15:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-18 15:00)
]
| top-nested of Value by ignore=max(1),
  top-nested 1 of ImportId by max(ImportId),
  top-nested 1 of ImportTime by max(ImportTime)
| project Value, ImportId, ImportTime

Value	ImportId	ImportTime
A	2022-05-14 00:00:00.0000000	2022-05-17 10:00:00.0000000
B	2022-05-14 00:00:00.0000000	2022-05-17 10:00:00.0000000

Answer 2

您也可以使用无限分区运算符尝试这种方法：

datatable(Value:string, ImportId:datetime, ImportTime:datetime)
[
    "A",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
    "A",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
    "A",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
    "B",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
    "A",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
    "B",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
    "A",    datetime(2022-05-11),   datetime(2022-05-18 15:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-18 15:00)
]
| partition hint.strategy = native by Value
(
    partition hint.strategy = native by ImportId
    (
        top 1 by ImportTime
    )
    | top 1 by ImportId
)

如何使用两列在 KQL 中使用 arg_max() 汇总数据？

How to summarize data with arg_max() in KQL using two columns?

kql

azure-data-explorer

方法一

方法二

方法 3