Azure 流分析处理事件的时间太长

Question

我正在尝试配置 Azure 流分析作业，但性能一直很差。我从将数据推送到事件中心的客户端系统接收到数据。 ASA 将其查询到 Azure SQL 数据库中。

几天前，我注意到它生成了大量的 InputEventLateBeyondThreshold 错误。这是 ASA 的示例。 Timestamp 元素由客户端系统设置。

{
    "Tag": "MANG_POWER_5",
    "Value": 1.08411181,
    "ValueType": "Analogue",
    "Timestamp": "2022-02-01T09:00:00.0000000Z",
    "EventProcessedUtcTime": "2022-02-01T09:36:05.1482308Z",
    "PartitionId": 0,
    "EventEnqueuedUtcTime": "2022-02-01T09:00:00.8980000Z"
}

您可以看到事件很快到达，但处理时间超过 30 分钟。为了尽量避免 InputEventLateBeyondThreshold 错误，我增加了延迟事件阈值。这可能会导致处理时间增加，但它太短也会增加 InputEventLateBeyondThreshold 错误的数量。

水印延迟一直很高，但 SU 使用率约为 5%。对于此查询，我已将 SU 增加到尽可能高的水平。

我想弄明白，为什么在事件到达后处理事件需要这么长时间。

这是我正在使用的查询：

WITH PIDataSet AS (SELECT * FROM [<event-hub>] TIMESTAMP BY timestamp)

--Write data to SQL joining with a lookup
SELECT   
    i.Timestamp as timestamp,
    i.Value as value,
INTO [<sql-database>]
FROM PIDataSet as i
INNER JOIN [tagmapping-ref-alias] tm ON tm.sourcename = i.Tag

----Write data to AzureTable joining with a lookup
SELECT
    DATEDIFF(second,CAST('1970-01-01' as DateTime), I1.Timestamp) As Rowkey,
    I2.TagId as PartitionKey,
    I1.Value as Value,
    UDF.formatTime(I1.Timestamp) as DeviceTimeStamp
    into [<azure-table>]
FROM PIDataSet as I1
    JOIN [tagmapping-ref-alias] as I2 on I2.Sourcename = I1.Tag

--Get an hourly count into a SQL Table.
SELECT
    I2.TagId,
    System.Timestamp() as WindowEndTime, COUNT(I2.TagId) AS InputCount
    into [tagmeta-ref-alias]
FROM PIDataSet as I1
    JOIN [tagmapping-ref-alias] as I2 on I2.Sourcename = I1.Tag
    GROUP BY I2.TagId, TumblingWindow(Duration(hour, 1))

Answer 1

当您设置 59 分钟 out-of-order window 时，您所做的就是为该输入设置 59 分钟的缓冲区。当记录落入该缓冲区时，它们将等待 59 分钟，直到它们出来。作为交换，我们有机会 re-order 这些事件，以便他们按顺序查找工作。

在 1 小时使用它是一种极端设置，根据定义，它会自动为您提供 59 分钟的水印。这非常令人惊讶，我想知道为什么您需要这么高的值。

编辑

现在查看迟到政策。

您正在使用活动时间 (TIMESTAMP BY timestamp)，这意味着您的活动现在可以延迟，请参阅 this doc and this one。

这意味着当记录晚于 1 小时（因此时间戳比我们服务器上挂钟的时间早于 1 小时 - 在 UTC 中），然后我们将其时间戳调整为我们的挂钟减 1 小时，然后发送它到查询。这也意味着您的翻滚 window 总是需要再等一个小时，以确保它不会丢失那些迟到的记录。

这里我会做的是恢复默认设置（没有乱序，5秒延迟事件，调整事件）。然后当你得到 InputEventLateBeyondThreshold 时，这意味着该作业收到了过去晚于 5 秒的 timestamp。您不会丢失数据，我们正在将其 system.timestamp 调整为更新的值（但不是时间戳字段，我们不会更改它）。

然后我们需要了解的是，为什么您的管道中的一条记录从生产到消费需要超过 5 秒的时间。是因为你的摄取管道有很大的延迟，还是因为你的生产者时钟有时间偏差？你知道吗？

Azure 流分析处理事件的时间太长

Azure Stream Analytics takes too long to process the events

azure-stream-analytics