在 SQL / Redshift 中的特定条件下查找前一行/条目

Question

我正在尝试查找数据库中特定事件的前一行，或者更确切地说是其中的一些数据。

在这个例子中，我想在用户访问酒吧之前找到前一行的 movement_method（按时间戳排序）。所以在汤姆的例子中，我想知道汤姆在去酒吧之前是开车回家的。（重要的不是他是怎么去酒吧的，而是去酒吧之前使用的方法）

我有一个示例数据库：用户、位置、movement_method、时间戳：

user	location	movement_method	timestamp
tom	work	car	2022-03-02 14:30
tom	home	car	2022-03-02 20:30
tom	pub	bus	2022-03-02 22:30
tom	home	foot	2022-03-03 02:30
jane	school	bus	2022-03-02 08:30
jane	home	bus	2022-03-02 14:30
jane	pub	foot	2022-03-02 21:30
jane	home	bus	2022-03-02 23:30
lila	work	bus	2022-03-02 08:30
lila	home	bus	2022-03-02 16:30
jake	friend	car	2022-03-02 15:30
jake	home	bus	2022-03-02 20:30
jake	pub	car	2022-03-02 20:30
jake	home	car	2022-03-03 02:30

对于这个数据库，我想要的结果是： |用户 | preceding_movement_method | | ---- | ------ | |汤姆 |汽车 | |简 |公共汽车 | |杰克 |公共汽车 |

lila 没有被举报，因为她从未去过酒吧
只需要了解前面的movement_method就可以去酒馆了（按时间排序）
之前去pub的movement_method不相关

我目前的方法是为“preceding_movement_method”创建一个分区或 window 函数，但我一直无法在符合 where 语句的条目之前找到“preceding”条目。

所以我正在寻找类似这样的伪代码：

select user,
 (select preceding movement_method 
  from movement_database 
  where location = 'pub'
  order by timestamp) as preceding_movement_method
from movement_database

Answer 1

好吧，我不确定这是否是一个打字错误，但用户 jake 在家里和在酒吧有一个相同的时间戳，这是不太可能发生的事件。代码可能看起来有点复杂，但确实考虑到了问题。

select t1.`user`, movement_method from movement t1 join
    (select m.`user`, max(m.`timestamp`) mx from movement m 
    join
        (select `user`,`timestamp` from movement where location ='pub') t
        on m.`user` = t.`user` 
        where  m.`timestamp` <=t.`timestamp` and m.`location`!='pub'
        group by `user`) t2
on t1.`user`=t2.`user` and t1.`timestamp`=mx and t1.location!='pub';

Answer 2

LAG() window 函数是我要处理的地方。我将 (sqlfiddle) 数据设置为：

create table movements (
  uname varchar(16),
  location  varchar(16),
  movement_method   varchar(16),
  ts timestamp
);
 
insert into movements values
('tom', 'work', 'car', '2022-03-02 14:30'),
('tom', 'home', 'car', '2022-03-02 20:30'),
('tom', 'pub', 'bus', '2022-03-02 22:30'),
('tom', 'home', 'foot', '2022-03-03 02:30'),
('jane', 'school', 'bus', '2022-03-02 08:30'),
('jane', 'home', 'bus', '2022-03-02 14:30'),
('jane', 'pub', 'foot', '2022-03-02 21:30'),
('jane', 'home', 'bus', '2022-03-02 23:30'),
('lila', 'work', 'bus', '2022-03-02 08:30'),
('lila', 'home', 'bus', '2022-03-02 16:30'),
('jake', 'friend', 'car', '2022-03-02 15:30'),
('jake', 'home', 'bus', '2022-03-02 20:30'),
('jake', 'pub', 'car', '2022-03-02 20:30'),
('jake', 'home', 'car', '2022-03-03 02:30');

而 SQL 为：

select uname, pmove 
from (
  select uname, location,
    lag (movement_method) over (partition by uname order by ts) as pmove
  from movements) as subq
where location = 'pub';

现在 Jake 的许多时间戳都是相同的，因此存在一些不确定性。

我会远离交叉连接/循环连接，因为你在 Redshift 中，这意味着非常大的数据集，这些进程可能会因如此大的数据而爆炸。

在 SQL / Redshift 中的特定条件下查找前一行/条目

Find preceding row / entry under specific conditions in SQL / Redshift

amazon-redshift