根据条件查询前几行

Query previous rows on a condition

我有一个 table 关于用户在网站上的航班预订模式的数据。假设以下数据是我拥有的关于我的用户的所有历史数据。

session_date 是用户访问网站并搜索特定航线的日期,而 flight_date 是航班的出发日期。我已经通过 session_date 订购了 table。结果记录在booked.

+---------+--------------+----------------+--------------+-------------+--------+
| user_id | session_date | departure_code | arrival_code | flight_date | booked |
+---------+--------------+----------------+--------------+-------------+--------+
| user1   | 7 Jan        | CA             | MY           | 8 Mar       |      1 |
| user1   | 8 Jan        | US             | MY           | 18 May      |      0 |
| user1   | 8 Jan        | US             | MY           | 18 May      |      1 |
| user1   | 8 Jan        | CA             | MY           | 19 Mar      |      0 |
| user1   | 9 Jan        | US             | MY           | 18 May      |      1 |
+---------+--------------+----------------+--------------+-------------+--------+

我想在我的 table 中输出一个名为 previous_flight_date 的新列。新列将在每次搜索时说明之前为该特定路线预订的 flight_date。即使用户多次搜索同一条路线但从未预订过,此列中的值也将为空。


+-------+--------------+----------------+--------------+-------------+--------+----------------------+
|  _id  | session_date | departure_code | arrival_code | flight_date | booked | previous_flight_date |
+-------+--------------+----------------+--------------+-------------+--------+----------------------+
| user1 | 7 Jan        | CA             | SG           | 8 Mar       |      1 | null                 |
| user1 | 8 Jan        | US             | MY           | 18 May      |      0 | null                 |
| user1 | 8 Jan        | US             | MY           | 18 May      |      1 | null                 |
| user1 | 8 Jan        | CA             | SG           | 19 Mar      |      0 | 8 Mar                |
| user1 | 2 Feb        | US             | MY           | 2 Jul       |      1 | 18 May               |
+-------+--------------+----------------+--------------+-------------+--------+----------------------+

因此,例如,该列在反映“3 月 8 日”的第 4 行之前将为空,因为用户已经预订了当天从 CA-->SG 出发的航班。

我试过使用 LAST_VALUE 但没有成功。当我有多种不同类型的路线时,我也不知道如何使用 LAG(),并且我想根据条件查找前几行。如果建议解决方案会很棒!谢谢。

我开始按照您的建议使用 LAG,但后来发现用短语表达查询相当困难。对于一种不使用分析函数的方法,我们可以尝试仅使用相关子查询来识别同一航线上最近预订的航班日期。

SELECT user_id, session_date, departure_code, arrival_code, flight_date, booked,
       (SELECT t2.flight_date FROM yourTable t2
        WHERE t2.departure_code = t1.departure_code AND
              t2.arrival_code = t1.arrival_code AND
              t2.booked = 1 AND
              t2.flight_date < t1.flight_date
        ORDER BY t2.flight_date DESC LIMIT 1) AS previous_flight_date
FROM yourTable t1
ORDER BY flight_date;

Demo

展示了 MariaDB 的演示,但相同的查询实际上应该 运行 在 BigQuery 上没有任何问题。

以下适用于 BigQuery 标准 SQL

#standardSQL
SELECT user_id, session_date, departure_code, arrival_code, flight_date, booked,
  MAX(IF(booked = 1, flight_date, NULL)) OVER(previous_flights) AS previous_flight_date
FROM `project.dataset.table` 
WINDOW previous_flights AS (
  PARTITION BY user_id, departure_code, arrival_code 
  ORDER BY flight_date 
  ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)

如果应用到您问题中的样本数据,如下例

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'user1' AS user_id, DATE '2020-01-07' AS session_date, 'CA' AS departure_code, 'SG' AS arrival_code, DATE '2020-03-08' AS flight_date, 1 AS booked UNION ALL
  SELECT 'user1', '2020-01-08', 'US', 'MY', '2020-05-18', 0 UNION ALL
  SELECT 'user1', '2020-01-08', 'US', 'MY', '2020-05-18', 1 UNION ALL
  SELECT 'user1', '2020-01-08', 'CA', 'SG', '2020-03-19', 0 UNION ALL
  SELECT 'user1', '2020-02-09', 'US', 'MY', '2020-07-02', 1
)
SELECT user_id, session_date, departure_code, arrival_code, flight_date, booked,
  MAX(IF(booked = 1, flight_date, NULL)) OVER(previous_flights) AS previous_flight_date
FROM `project.dataset.table` 
WINDOW previous_flights AS (
  PARTITION BY user_id, departure_code, arrival_code 
  ORDER BY flight_date 
  ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
-- ORDER BY flight_date

输出是

Row user_id session_date    departure_code  arrival_code    flight_date booked  previous_flight_date     
1   user1   2020-01-07      CA              SG              2020-03-08  1       null     
2   user1   2020-01-08      CA              SG              2020-03-19  0       2020-03-08   
3   user1   2020-01-08      US              MY              2020-05-18  0       null     
4   user1   2020-01-08      US              MY              2020-05-18  1       null     
5   user1   2020-02-09      US              MY              2020-07-02  1       2020-05-18   

以下是 SQL 使用窗口函数的基于服务器的解决方案。 Big Query 解决方案应该类似于窗口函数是标准的

SELECT
    *
    , Previous_Flight_Date = MAX(CASE booked = 1 THEN flight_date ELSE NULL END ) 
                             OVER (
                                    PARTITION BY user_id, departure_code, arrival_code
                                    ORDER BY flight_date
                                    ROWS UNBOUNDED PRECEDING AND 1 PRECEDING
                             )
FROM historicTable t

我想你可以用 first_value() 做到这一点。诀窍是在 window 函数中放置一个条件,打开 ignore nulls 选项,然后使用 window 帧规范回顾具有相同 [=20] 的前几行=], 不包括当前行:

select
    t.*,
    first_value(case when booked = 1 then flight_date end ignore nulls) over(
        partition by departure_code, arrival code
        order by flight_date desc
        rows between unbounded preceding and 1 preceding
    ) previous_flight_date
from mytable t

实际上 window max() 也可以(然后,不需要 ignore nulls):

select
    t.*,
    max(case when booked = 1 then flight_date end) over(
        partition by departure_code, arrival code
        order by flight_date desc
        rows between unbounded preceding and 1 preceding
    ) previous_flight_date
from mytable t