PostgreSQL结合LAG和LEAD查询前后n行
PostgreSQL combine LAG and LEAD to query n previous and following rows
我有一个 PostgreSQL table,我们称它为 tokens,在文本行中包含每个标记的语法注释,基本上是这样的:
idx | line | tno | token | annotation | lemma
----+------+-----+---------+-----------------+---------
1 | I.01 | 1 | This | DEM.PROX | this
2 | I.01 | 2 | is | VB.COP.3SG.PRES | be
3 | I.01 | 3 | an | ART.INDEF | a
4 | I.01 | 4 | example | NN.INAN | example
我想创建一个允许我搜索语法上下文的查询,在本例中,查询检查某个注释是否存在于 window 大小 n 当前行前后。从我读到的内容来看,PostgreSQL 的 Window 函数 LEAD
和 LAG
是 suitable 来实现这一点的。作为第一步,我根据我能找到的有关这些函数的文档编写了以下查询:
SELECT *
FROM (
SELECT token, annotation, lemma,
-- LAG(annotation) OVER prev_rows AS prev_anno, -- ?????
LEAD(annotation) OVER next_rows AS next_anno
FROM tokens
WINDOW next_rows AS (
ORDER BY line, tno ASC
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
)
ORDER BY line, tno ASC
) AS "window"
WHERE
lemma LIKE '...'
AND "window".next_anno LIKE '...'
;
但是,这只会搜索后面的 2 行。我的问题是,如何重新表述查询以使 window 包含 table 中的前后行?显然,我不能有 2 WINDOW
语句或做类似
的事情
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
AND ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
我不太确定我是否正确理解了您的用例:您想检查一个给定注释是否在 5 行之一(前面 2 行,当前行,后面 2 行)中。正确吗?
- 可以像
BETWEEN 2 PRECEDING AND 2 FOLLOWING
那样定义window
LEAD
或 LAG
只给出一个值,在本例中是当前行之后或之前的一个值 - 如果 window 支持它;无论您的 window 包含多少行。但是您想签入这五行中的任何一行。
实现此目的的一种方法:
demo: db<>fiddle
SELECT *
FROM (
SELECT token, annotation, lemma,
unnest(array_agg(annotation) OVER w) as surrounded_annos -- 2
FROM tokens
WINDOW w AS ( -- 1
ORDER BY line, tno ASC
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
)
ORDER BY line, tno ASC
) AS "window"
WHERE
lemma LIKE '...'
AND "window".surrounded_annos LIKE '...'
;
- 如上所述定义 window
- 将这五行中的所有注释(如果可能)与
array_agg
聚合,得到一个数组
unnest
将此数组扩展为每个元素一行,因为恕我直言,无法使用 LIKE
搜索数组元素。这给你这个结果(可以在下一步中过滤):
结果子查询:
token annotation lemma surrounded_annos
This DEM.PROX this DEM.PROX
This DEM.PROX this VB.COP.3SG.PRES
This DEM.PROX this ART.INDEF
is VB.COP.3SG.PRES be DEM.PROX
is VB.COP.3SG.PRES be VB.COP.3SG.PRES
is VB.COP.3SG.PRES be ART.INDEF
is VB.COP.3SG.PRES be NN.INAN
an ART.INDEF a DEM.PROX
an ART.INDEF a VB.COP.3SG.PRES
an ART.INDEF a ART.INDEF
an ART.INDEF a NN.INAN
example NN.INAN example VB.COP.3SG.PRES
example NN.INAN example ART.INDEF
example NN.INAN example NN.
另一种方法是计算句子中每个标记的相对位置,并执行标记的自连接<-->标记(这将允许您 select 基于距离的 skip-grams):
WITH www AS ( -- enumerate word posision with sentences
SELECT line, tno -- candidate key
, row_number() OVER sentence AS rn
FROM tokens
WINDOW sentence AS ( ORDER BY line ASC, tno ASC)
)
SELECT t0.line AS line
, t0.token AS this
, t1.tno AS tno
, w1.rn - w0.rn AS rel -- relative position
, t1.token AS that
, t1.annotation AS anno
FROM tokens t0
JOIN tokens t1 ON t1.line = t0.line -- same sentence
JOIN www w0 ON t0.line = w0.line AND t0.tno= w0.tno -- PK1
JOIN www w1 ON t1.line = w1.line AND t1.tno= w1.tno -- PK2
WHERE 1=1
AND t0.lemma LIKE 'be'
-- AND t1.annotation LIKE '.PROX' AND w1.rn - w0.rn = -1
;
-- But, if you rno is consecutive(gapless) within lines,
-- you can omit the enumeration step, and do a plain self-join:
SELECT t0.line AS line
, t0.token AS this
, t1.tno AS tno
, t1.tno - t0.tno AS rel -- relative position
, t1.token AS that
, t1.annotation AS anno
FROM tokens t0
JOIN tokens t1 ON t1.line = t0.line -- same sentence
WHERE 1=1
AND t0.lemma LIKE 'be'
-- AND t1.annotation LIKE '.PROX' AND w1.rn - w0.rn = -1
;
我有一个 PostgreSQL table,我们称它为 tokens,在文本行中包含每个标记的语法注释,基本上是这样的:
idx | line | tno | token | annotation | lemma
----+------+-----+---------+-----------------+---------
1 | I.01 | 1 | This | DEM.PROX | this
2 | I.01 | 2 | is | VB.COP.3SG.PRES | be
3 | I.01 | 3 | an | ART.INDEF | a
4 | I.01 | 4 | example | NN.INAN | example
我想创建一个允许我搜索语法上下文的查询,在本例中,查询检查某个注释是否存在于 window 大小 n 当前行前后。从我读到的内容来看,PostgreSQL 的 Window 函数 LEAD
和 LAG
是 suitable 来实现这一点的。作为第一步,我根据我能找到的有关这些函数的文档编写了以下查询:
SELECT *
FROM (
SELECT token, annotation, lemma,
-- LAG(annotation) OVER prev_rows AS prev_anno, -- ?????
LEAD(annotation) OVER next_rows AS next_anno
FROM tokens
WINDOW next_rows AS (
ORDER BY line, tno ASC
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
)
ORDER BY line, tno ASC
) AS "window"
WHERE
lemma LIKE '...'
AND "window".next_anno LIKE '...'
;
但是,这只会搜索后面的 2 行。我的问题是,如何重新表述查询以使 window 包含 table 中的前后行?显然,我不能有 2 WINDOW
语句或做类似
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
AND ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
我不太确定我是否正确理解了您的用例:您想检查一个给定注释是否在 5 行之一(前面 2 行,当前行,后面 2 行)中。正确吗?
- 可以像
BETWEEN 2 PRECEDING AND 2 FOLLOWING
那样定义window
LEAD
或LAG
只给出一个值,在本例中是当前行之后或之前的一个值 - 如果 window 支持它;无论您的 window 包含多少行。但是您想签入这五行中的任何一行。
实现此目的的一种方法:
demo: db<>fiddle
SELECT *
FROM (
SELECT token, annotation, lemma,
unnest(array_agg(annotation) OVER w) as surrounded_annos -- 2
FROM tokens
WINDOW w AS ( -- 1
ORDER BY line, tno ASC
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
)
ORDER BY line, tno ASC
) AS "window"
WHERE
lemma LIKE '...'
AND "window".surrounded_annos LIKE '...'
;
- 如上所述定义 window
- 将这五行中的所有注释(如果可能)与
array_agg
聚合,得到一个数组 unnest
将此数组扩展为每个元素一行,因为恕我直言,无法使用LIKE
搜索数组元素。这给你这个结果(可以在下一步中过滤):
- 将这五行中的所有注释(如果可能)与
结果子查询:
token annotation lemma surrounded_annos
This DEM.PROX this DEM.PROX
This DEM.PROX this VB.COP.3SG.PRES
This DEM.PROX this ART.INDEF
is VB.COP.3SG.PRES be DEM.PROX
is VB.COP.3SG.PRES be VB.COP.3SG.PRES
is VB.COP.3SG.PRES be ART.INDEF
is VB.COP.3SG.PRES be NN.INAN
an ART.INDEF a DEM.PROX
an ART.INDEF a VB.COP.3SG.PRES
an ART.INDEF a ART.INDEF
an ART.INDEF a NN.INAN
example NN.INAN example VB.COP.3SG.PRES
example NN.INAN example ART.INDEF
example NN.INAN example NN.
另一种方法是计算句子中每个标记的相对位置,并执行标记的自连接<-->标记(这将允许您 select 基于距离的 skip-grams):
WITH www AS ( -- enumerate word posision with sentences
SELECT line, tno -- candidate key
, row_number() OVER sentence AS rn
FROM tokens
WINDOW sentence AS ( ORDER BY line ASC, tno ASC)
)
SELECT t0.line AS line
, t0.token AS this
, t1.tno AS tno
, w1.rn - w0.rn AS rel -- relative position
, t1.token AS that
, t1.annotation AS anno
FROM tokens t0
JOIN tokens t1 ON t1.line = t0.line -- same sentence
JOIN www w0 ON t0.line = w0.line AND t0.tno= w0.tno -- PK1
JOIN www w1 ON t1.line = w1.line AND t1.tno= w1.tno -- PK2
WHERE 1=1
AND t0.lemma LIKE 'be'
-- AND t1.annotation LIKE '.PROX' AND w1.rn - w0.rn = -1
;
-- But, if you rno is consecutive(gapless) within lines,
-- you can omit the enumeration step, and do a plain self-join:
SELECT t0.line AS line
, t0.token AS this
, t1.tno AS tno
, t1.tno - t0.tno AS rel -- relative position
, t1.token AS that
, t1.annotation AS anno
FROM tokens t0
JOIN tokens t1 ON t1.line = t0.line -- same sentence
WHERE 1=1
AND t0.lemma LIKE 'be'
-- AND t1.annotation LIKE '.PROX' AND w1.rn - w0.rn = -1
;