在 BigQuery 中查询?

Querying in BigQuery?

我在 BigQuery 中有一个包 table,如下所示:

 Packageid  Scanid  dispatchid  timestamp   status
   p1         s1       null        t1        'in'
   p2         s1       xxx         t2        'in'
   p1         s2       yyy         t3        'pkin'
   p1         s3       sss         t4        'iwi'
   p1         s4       eee         t5        'lhp'
   p2         s2       uuuu        t6        'uio'
   p2         s3       null        t7        'jsk'

我想检索以下详细信息:

Packageid   Latest-Scanid   First-Dispatch-time  Last-Dispatch-time   latest-status

 p1            s4                 t3                 t5                 'lhp'
 p2            s3                 t2                 t6                 'jsk'  

First-Dispatch-time 是包裹扫描中第一次出现dispatch id 的时间。 Last-Dispatch-time 是包裹扫描中最后一次dispatch id 出现的时间。

是否有任何方法可以使用 BigQuery 或您在 BigQuery 中定义的函数来获得上述 table?

一种方法使用 windows 函数和条件聚合:

select packageid,
       max(case when seqnum = 1 then dispatchid end) as dispatchid,
       min(case when dispatchid is not null then timestamp end) as first_dispatchid,
       max(case when dispatchid is not null then timestamp end) as last_dispatchid,
       max(case when seqnum = 1 then status end) as status
from (select t.*,
             row_number() over (partition by packageid order by timestamp desc) as seqnum
      from t
     ) t
group by packageid;

我会注意到这是针对 SQL 服务器的,可能在 MYSQL 中工作也可能不工作。

SELECT Packageid, 
    MAX(Scanid) [Latest_Scanid], 
    MIN(timestamp) [First-Dispatch-time], 
    MAX(timestamp) [Last-Dispatch-time],
    (SELECT status FROM Package p WHERE p.timestamp = Package.timestamp AND p.Packageid = Package.Packageid) [latest-status]
FROM Package

下面的查询使用了一个 "dirty" 技巧(参见 not_null_ts),它允许消除外部分组依据,而是在内部 select

中计算所有内容
SELECT packageid, latest_scanid, first_dispatch_time, last_dispatch_time, latest_status
FROM (
  SELECT packageid, 
    IF(dispatchid IS NULL, NULL, ts) AS not_null_ts,
    FIRST_VALUE(scanid) OVER(PARTITION BY packageid ORDER BY ts DESC) AS latest_scanid,
    MIN(not_null_ts) OVER(PARTITION BY packageid) AS first_dispatch_time,
    MAX(not_null_ts) OVER(PARTITION BY packageid) AS last_dispatch_time,
    FIRST_VALUE(status) OVER(PARTITION BY packageid ORDER BY ts DESC) AS latest_status,
    ROW_NUMBER() OVER(PARTITION BY packageid ORDER BY not_null_ts DESC) AS line
  FROM YourTable 
)
WHERE line = 1

我前一段时间发现这种技巧对我有用,但我不认为我曾经明确地看到过这个记录,除非这可能是明显的用途 - 我从来没有想太多。