如果在 1 秒内重试成功,则分配标志

Assign flag if successful retry within 1 second

我正在尝试根据 1 秒内的成功重试为数据中的每个条目分配标志。这是示例数据

date       id  event            url  code  event_ts
2021-08-20 11  1629515037.0682  xyz  503   2021-08-20 20:03:57.068
2021-08-20 11  1629515037.1073  xyz  200   2021-08-20 20:03:57.107  -- successful retry within 1 sec
2021-08-20 12  1629515037.1866  abc  503   2021-08-20 20:03:57.187
2021-08-20 12  1629515037.1942  abc  503   2021-08-20 20:03:57.194
2021-08-20 12  1629515037.2037  abc  503   2021-08-20 20:03:57.204
2021-08-20 12  1629515037.2249  abc  503   2021-08-20 20:03:57.225
2021-08-20 12  1629515064.2427  abc  200   2021-08-20 20:04:24.243  -- successful retry within 1 sec

我想创建一个新列重试,

if code = 503, successful retry within 1 sec -> successful_retry
if code = 503, successful retry after 1 sec -> successful_retry_after_1_sec
if code = 503, no successful retry at all -> no_successful_retry

我主要是 Python/Pandas 人,但需要立即解决这个问题。我尝试使用 LEAD() 但无法编写具有可变偏移量的解决方案。多谢指点

编辑:基于@Gordon Linoff 的回答

SELECT
    date,
    id,
    url,
    event,
    FROM_UNIXTIME(event) AS event_ts,
    code,
    (
        CASE
            WHEN code <= 399 THEN 'successful_response'
            WHEN MIN(CASE WHEN code <= 399 THEN FROM_UNIXTIME(event) END) OVER (
                PARTITION BY
                    date,
                    id,
                    url
                ORDER BY
                    date,
                    id,
                    url,
                    event rows BETWEEN CURRENT ROW AND UNBOUNDED following
            ) <= FROM_UNIXTIME(event) + INTERVAL '1' SECOND THEN 'success_retry_within_1_sec'
            WHEN MIN(CASE WHEN code <= 399 THEN FROM_UNIXTIME(event) END) OVER (
                PARTITION BY
                    date,
                    id,
                    url
                ORDER BY
                    date,
                    id,
                    url,
                    event rows BETWEEN CURRENT ROW AND UNBOUNDED following
            ) > FROM_UNIXTIME(event) + INTERVAL '1' SECOND THEN 'success_retry_after_1_sec'
            ELSE 'No_successful_retry'
        END
    ) AS successful_retry_flag
FROM t

如果我假设 200 次是一次成功重试,您可以使用累积最小值获得下一次成功重试。剩下的就是设置标志的日期算法:

select t.*,
       (case when min(case when code = 200 then event_ts end) over
                      (partition by id
                       order by event_ts
                       rows between current row and unbounded following
                      ) < event_ts + interval '1' second
             then 1 else 0
        end) as successful_retry_flag
from t;

您还可以使用更具可读性和可扩展性的 MATCH_RECOGNIZE solution, which was added recently to Trino(formerly PrestoSQL)。

使用 MATCH_RECOGNIZE 解决方案,您将根据正在扫描的当前行中 code 的值定义标签 successfailure。您还可以使用 MEASURES 子句定义每行之间的时间度量,以根据我们可以定义为 time_to_success 的当前失败时间戳为您提供最后成功行的 LAST 时间戳。使用通过模式匹配定义的这些值,您现在可以使用 CASE 语句过滤它们,就像@Gordon Linoff 的解决方案一样。

trino> WITH t(date, id, event, url, code, event_ts) AS (VALUES
         ->     (DATE '2021-08-20', 11, 1629515037.0682, 'xyz', 503, TIMESTAMP '2021-08-20 20:03:57.068'),
         ->     (DATE '2021-08-20', 11, 1629515037.1073, 'xyz', 200, TIMESTAMP '2021-08-20 20:03:57.107'),
         ->     (DATE '2021-08-20', 12, 1629515037.1866, 'abc', 503, TIMESTAMP '2021-08-20 20:03:57.187'),
         ->     (DATE '2021-08-20', 12, 1629515037.1942, 'abc', 503, TIMESTAMP '2021-08-20 20:03:57.194'),
         ->     (DATE '2021-08-20', 12, 1629515037.2037, 'abc', 503, TIMESTAMP '2021-08-20 20:03:57.204'),
         ->     (DATE '2021-08-20', 12, 1629515037.2249, 'abc', 503, TIMESTAMP '2021-08-20 20:03:57.225'),
         ->     (DATE '2021-08-20', 12, 1629515064.2427, 'abc', 200, TIMESTAMP '2021-08-20 20:04:24.243')
         ->     )
         -> SELECT date, id, event, url, code, event_ts,
         ->        CASE
         ->            WHEN code = 200 THEN 'successful_response'
         ->            WHEN time_to_success < INTERVAL '1' SECOND THEN 'successful_retry'
         ->            WHEN time_to_success >= INTERVAL '1' SECOND THEN 'successful_retry_after_1_sec'
         ->            WHEN time_to_success IS NULL THEN 'no_successful_retry'
         ->        END flag
         -> FROM t
         -> MATCH_RECOGNIZE (
         ->     PARTITION BY id
         ->     ORDER BY event_ts
         ->     MEASURES FINAL LAST(success.event_ts) - failure.event_ts AS time_to_success
         ->     ALL ROWS PER MATCH WITH UNMATCHED ROWS
         ->     PATTERN (success* failure+ success)
         ->     DEFINE
         ->            success AS code = 200,
         ->            failure AS code = 503
         -> );
    date    | id |      event      | url | code |        event_ts         |             flag
------------+----+-----------------+-----+------+-------------------------+------------------------------
 2021-08-20 | 12 | 1629515037.1866 | abc |  503 | 2021-08-20 20:03:57.187 | successful_retry_after_1_sec
 2021-08-20 | 12 | 1629515037.1942 | abc |  503 | 2021-08-20 20:03:57.194 | successful_retry_after_1_sec
 2021-08-20 | 12 | 1629515037.2037 | abc |  503 | 2021-08-20 20:03:57.204 | successful_retry_after_1_sec
 2021-08-20 | 12 | 1629515037.2249 | abc |  503 | 2021-08-20 20:03:57.225 | successful_retry_after_1_sec
 2021-08-20 | 12 | 1629515064.2427 | abc |  200 | 2021-08-20 20:04:24.243 | successful_response
 2021-08-20 | 11 | 1629515037.0682 | xyz |  503 | 2021-08-20 20:03:57.068 | successful_retry
 2021-08-20 | 11 | 1629515037.1073 | xyz |  200 | 2021-08-20 20:03:57.107 | successful_response
(7 rows)