比较日期时间对象的数组并选择每行与下一行之间的差异小于 7 天的所有行

Compare array of datetime objects and pick all rows where difference between each and the next is less than 7 days

我的 table 看起来像这样:

(还不能 post 图片)

我想 select 我的 table 中的所有名称,其中每个日期时间对象与下一个对象之间的时差始终超过 7 天。 所以从上面我只会得到保罗,因为亚当的前两次已经只相隔一天。

我能想到的最好办法是获取数组中最小和最大日期时间之间的时间差,然后除以 array_length(datetime)。所以基本上是所有 datetime 对象的平均时间,但这对我没有帮助。

我在 BigQuery

上使用标准 SQL
SELECT name
FROM dataset.table
WHERE NOT EXISTS(
  SELECT 1 FROM UNNEST(datetime) AS dt WITH OFFSET off
  WHERE DATETIME_DIFF(
    datetime[SAFE_OFFSET(off - 1)], dt, DAY
  ) <= 7
)

这会将数组中的每个条目与其后的条目进行比较,查找天数等于或小于 7 的任何条目。

您可以使用 unnest():

select t.*
from t
where not exists (select 1
                  from (select dt, lag(dt) over (order by dt) as prev_dt
                        from unnest(datetime) dt
                       ) x
                  where dt < datetime_add(prev_dt, interval 7 day
                 );

目前还不清楚您的数据的确切架构是什么:基于布局 - 看起来日期时间是一个数组,但基于您在图像中显示的数据类型 - 它可能只是常规字段,所以在下面涵盖这两种情况(对于 BigQuery Standard SQL)

Case 1 - repeated field

#standardSQL
SELECT name
FROM `project.dataset.table`
WHERE 7 < (
    SELECT DATETIME_DIFF(
      datetime, 
      LAG(datetime) OVER(PARTITION BY name ORDER BY datetime), 
      DAY) distance
    FROM UNNEST(datetime) datetime 
    ORDER BY IFNULL(distance, 777)
    LIMIT 1
  ) 

你可以使用下面的虚拟数据来测试和使用它

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'Adam' name, 
    [DATETIME '2018-07-26T17:55:03', 
      '2018-07-27T17:55:03',
      '2018-06-29T17:55:03',
      '2018-07-16T17:55:03',
      '2018-08-19T17:55:03',
      '2018-07-14T17:55:03'] datetime UNION ALL
  SELECT 'Paul', [DATETIME '2018-08-26T17:55:03',
      '2018-08-18T17:55:03',
      '2018-06-20T17:55:03',
      '2018-08-09T17:55:03',
      '2018-07-16T17:55:03']
)
SELECT name
FROM `project.dataset.table`
WHERE 7 < (
    SELECT DATETIME_DIFF(
      datetime, 
      LAG(datetime) OVER(PARTITION BY name ORDER BY datetime), 
      DAY) distance
    FROM UNNEST(datetime) datetime 
    ORDER BY IFNULL(distance, 777)
    LIMIT 1
  ) 

Case 2 - regular (not repeated field)

#standardSQL
SELECT name FROM (
  SELECT name, 
    DATETIME_DIFF(
      datetime, 
      LAG(datetime) OVER(PARTITION BY name ORDER BY datetime), 
      DAY
    ) distance
  FROM `project.dataset.table`
)
GROUP BY name 
HAVING MIN(distance) > 7

下面的虚拟数据示例:

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'Adam' name, DATETIME '2018-07-26T17:55:03' datetime UNION ALL
  SELECT 'Adam', '2018-07-27T17:55:03' UNION ALL
  SELECT 'Adam', '2018-06-29T17:55:03' UNION ALL
  SELECT 'Adam', '2018-07-16T17:55:03' UNION ALL
  SELECT 'Adam', '2018-08-19T17:55:03' UNION ALL
  SELECT 'Adam', '2018-07-14T17:55:03' UNION ALL
  SELECT 'Paul', '2018-08-26T17:55:03' UNION ALL
  SELECT 'Paul', '2018-08-18T17:55:03' UNION ALL
  SELECT 'Paul', '2018-06-20T17:55:03' UNION ALL
  SELECT 'Paul', '2018-08-09T17:55:03' UNION ALL
  SELECT 'Paul', '2018-07-16T17:55:03' 
)
SELECT name FROM (
  SELECT name, 
    DATETIME_DIFF(
      datetime, 
      LAG(datetime) OVER(PARTITION BY name ORDER BY datetime), 
      DAY
    ) distance
  FROM `project.dataset.table`
)
GROUP BY name 
HAVING MIN(distance) > 7   

两者 return 结果相同

Row name     
1   Paul