在许多相邻行上使用 WHERE 进行缓慢的 Postgres 查询
Painfully slow Postgres query using WHERE on many adjacent rows
我有以下 psql table。它总共有大约 20 亿行。
id word lemma pos textid source
1 Stuffing stuff vvg 190568 AN
2 her her appge 190568 AN
3 key key nn1 190568 AN
4 into into ii 190568 AN
5 the the at 190568 AN
6 lock lock nn1 190568 AN
7 she she appge 190568 AN
8 pushed push vvd 190568 AN
9 her her appge 190568 AN
10 way way nn1 190568 AN
11 into into ii 190568 AN
12 the the appge 190568 AN
13 house house nn1 190568 AN
14 . . 190568 AN
15 She she appge 190568 AN
16 had have vhd 190568 AN
17 also also rr 190568 AN
18 cajoled cajole vvd 190568 AN
19 her her appge 190568 AN
20 way way nn1 190568 AN
21 into into ii 190568 AN
22 the the at 190568 AN
23 home home nn1 190568 AN
24 . . 190568 AN
.. ... ... .. ... ..
我想创建以下 table,它会并排显示所有 "way" 的结构,以及 "source"、[= 列中的一些数据31=] 和 "pos"。
source word word word lemma pos word word word word word lemma pos word word
AN lock she pushed push vvd her way into the house house nn1 . she
AN had also cajoled cajole vvd her way into the home home nn1 . A
AN tried to force force vvi her way into the palace palace nn1 , officials
在这里你可以看到我使用的代码:
copy(
SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM
orderedflatcorpus AS c1, orderedflatcorpus AS c2, orderedflatcorpus AS c3, orderedflatcorpus AS c4, orderedflatcorpus AS c5, orderedflatcorpus AS c6, orderedflatcorpus AS c7, orderedflatcorpus AS c8, orderedflatcorpus AS c9, orderedflatcorpus AS c10, orderedflatcorpus AS c11
WHERE
c1.word LIKE '%' AND
c2.word LIKE '%' AND
c3.word LIKE '%' AND
c4.pos LIKE 'v%' AND
c5.pos = 'appge' AND
c6.lemma = 'way' AND
c7.pos LIKE 'i%' AND
c8.word = 'the' AND
c9.pos LIKE 'n%' AND
c10.word LIKE '%' AND
c11.word LIKE '%'
AND
c1.id + 1 = c2.id AND c1.id + 2 = c3.id AND c1.id + 3 = c4.id AND c1.id + 4 = c5.id AND c1.id + 5 = c6.id AND c1.id + 6 = c7.id AND c1.id + 7 = c8.id AND c1.id + 8 = c9.id AND c1.id + 9 = c10.id AND c1.id + 10 = c11.id
ORDER BY c1.id
)
TO
'/home/postgres/Results/OUTPUT.csv'
DELIMITER E'\t'
csv header;
为 20 亿行(结果大约有 19,000 行)执行查询需要将近 9 个小时。
我可以做些什么来提高性能?
word、pos 和 lemma 列已经有 btree 索引。
我是否应该坚持我的代码并简单地使用一个更强大的服务器cores/a更快CPU和更多的内存(我的只有 8 GB 内存,只有 2 个内核和 2.8 GHz )?或者您会推荐一个不同的、更有效的 SQL 查询吗?
谢谢!
我建议使用现代连接语法,这可能会很好地解决问题:
SELECT
c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM orderedflatcorpus AS c1
JOIN orderedflatcorpus AS c2 ON c1.id + 1 = c2.id
JOIN orderedflatcorpus AS c3 ON c1.id + 2 = c3.id
JOIN orderedflatcorpus AS c4 ON c1.id + 3 = c4.id
JOIN orderedflatcorpus AS c5 ON c1.id + 4 = c5.id
JOIN orderedflatcorpus AS c6 ON c1.id + 5 = c6.id
JOIN orderedflatcorpus AS c7 ON c1.id + 6 = c7.id
JOIN orderedflatcorpus AS c8 ON c1.id + 7 = c8.id
JOIN orderedflatcorpus AS c9 ON c1.id + 8 = c9.id
JOIN orderedflatcorpus AS c10 ON c1.id + 9 = c10.id
JOIN orderedflatcorpus AS c11 ON c1.id + 10 = c11.id
WHERE c4.pos LIKE 'v%'
AND c5.pos = 'appge'
AND c6.lemma = 'way'
AND c7.pos LIKE 'i%'
AND c8.word = 'the'
AND c9.pos LIKE 'n%'
备注:
- 删除了冗余的
LIKE
ORDER BY
去掉了,因为太贵了。 CSV(如 table 行)不需要排序即可生效。如果您绝对需要排序,请在执行查询后使用命令行工具对其进行排序。
Step1:使用一个window函数获取相邻条记录,避免痛苦的自连接(12tables已经非常接近极限了其中 geqo 接管):
copy(
WITH stuff AS (
SELECT c1.id , c1.source, c1.word
, LEAD ( c1.word, 1) OVER (www) AS c2w
, LEAD (c1.word, 2) OVER (www) AS c3w
, LEAD ( c1.word, 3) OVER (www) AS c4w
, LEAD (c1.lemma, 3) OVER (www) AS c4l
, LEAD (c1.pos, 3) OVER (www) AS c4p
, LEAD (c1.pos, 4) OVER (www) AS c5p
, LEAD (c1.word, 4) OVER (www) AS c5w
, LEAD (c1.word, 5) OVER (www) AS c6w
, LEAD (c1.lemma, 5) OVER (www) AS c6l
, LEAD (c1.word, 6) OVER (www) AS c7w
, LEAD (c1.pos, 6) OVER (www) AS c7p
, LEAD (c1.word, 7) OVER (www) AS c8w
, LEAD (c1.word, 8) OVER (www) AS c9w
, LEAD (c1.lemma, 8) OVER (www) AS c9l
, LEAD (c1.pos, 8) OVER (www) AS c9p
, LEAD (c1.word, 9) OVER (www) AS c10w
, LEAD (c1.word, 10) OVER (www) AS c11w
FROM orderedflatcorpus AS c1
WINDOW www AS (ORDER BY id)
)
SELECT id , source, word
, c2w
, c3w
, c4w
, c4l
, c4p
, c5w
, c6w
, c7w
, c8w
, c9w
, c9l
, c9p
, c10w
, c11w
FROM stuff
WHERE 1=1
AND c4p LIKE 'v%'
AND c5p = 'appge'
AND c6l = 'way'
AND c7p LIKE 'i%'
AND c8w = 'the'
AND c9p LIKE 'n%'
ORDER BY id
)
-- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header;
TO '/tmp/OUTPUT2.csv' DELIMITER E'\t' csv header;
第 2 步:[数据模型] {word,lemma, pos} 列似乎是一个低基数组,您可以将它们压缩到一个单独的 token/lemma/pos-table:
-- An index to speedup the unique extraction and final update
-- (the index will be dropped automatically
-- once the columns are dropped)
CREATE INDEX ON tmp.orderedflatcorpus (word, lemma, pos );
ANALYZE tmp.orderedflatcorpus;
-- table containing the "squeezed out" domain
CREATE TABLE tmp.words AS
SELECT DISTINCT word, lemma, pos
FROM tmp.orderedflatcorpus
;
ALTER TABLE tmp.words
ADD COLUMN id SERIAL NOT NULL PRIMARY KEY;
ALTER TABLE tmp.words
ADD UNIQUE (word , lemma, pos );
-- The original table needs an FK "link" to the new table
ALTER TABLE tmp.orderedflatcorpus
ADD column words_id INTEGER -- NOT NULL
REFERENCES tmp.words(id)
;
-- FK constraints are helped a lot by a supportive index.
CREATE INDEX orderedflatcorpus_words_id_fk ON tmp.orderedflatcorpus (words_id)
;
ANALYZE tmp.orderedflatcorpus;
ANALYZE tmp.words;
-- Initialize the FK column in the original table.
-- we need NOT DISTINCT FROM here, since the joined
-- columns could contain NULLs , which MUST compare equal.
-- ------------------------------------------------------
UPDATE tmp.orderedflatcorpus dst
SET words_id = src.id
FROM tmp.words src
WHERE src.word IS NOT DISTINCT FROM dst.word
AND dst.lemma IS NOT DISTINCT FROM src.lemma
AND dst.pos IS NOT DISTINCT FROM src.pos
;
ALTER TABLE tmp.orderedflatcorpus
DROP column word
, DROP column lemma
, DROP column pos
;
以及新的查询,对单词进行 JOIN -table:
copy(
WITH stuff AS (
SELECT c1.id , c1.source, w.word
, LEAD ( w.word, 1) OVER (www) AS c2w
, LEAD (w.word, 2) OVER (www) AS c3w
, LEAD ( w.word, 3) OVER (www) AS c4w
, LEAD (w.lemma, 3) OVER (www) AS c4l
, LEAD (w.pos, 3) OVER (www) AS c4p
, LEAD (w.pos, 4) OVER (www) AS c5p
, LEAD (w.word, 4) OVER (www) AS c5w
, LEAD (w.word, 5) OVER (www) AS c6w
, LEAD (w.lemma, 5) OVER (www) AS c6l
, LEAD (w.word, 6) OVER (www) AS c7w
, LEAD (w.pos, 6) OVER (www) AS c7p
, LEAD (w.word, 7) OVER (www) AS c8w
, LEAD (w.word, 8) OVER (www) AS c9w
, LEAD (w.lemma, 8) OVER (www) AS c9l
, LEAD (w.pos, 8) OVER (www) AS c9p
, LEAD (w.word, 9) OVER (www) AS c10w
, LEAD (w.word, 10) OVER (www) AS c11w
FROM orderedflatcorpus AS c1
JOIN words w ON w.id=c1.words_id
WINDOW www AS (ORDER BY c1.id)
)
SELECT id , source, word
, c2w , c3w
, c4w , c4l , c4p
, c5w
, c6w
, c7w
, c8w
, c9w , c9l , c9p
, c10w
, c11w
FROM stuff
WHERE 1=1
AND c4p LIKE 'v%'
AND c5p = 'appge'
AND c6l = 'way'
AND c7p LIKE 'i%'
AND c8w = 'the'
AND c9p LIKE 'n%'
ORDER BY id
)
-- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header;
TO '/tmp/OUTPUT3.csv' DELIMITER E'\t' csv header;
注意:我在输出中得到两行,因为我把条件放宽了一点......
Update :第一次查询,避免CTE
copy(
SELECT id , source, word
, c2w
, c3w
, c4w
, c4l
, c4p
, c5w
, c6w
, c7w
, c8w
, c9w
, c9l
, c9p
, c10w
, c11w
FROM (
SELECT c1.id , c1.source, c1.word
, LEAD ( c1.word, 1) OVER (www) AS c2w
, LEAD (c1.word, 2) OVER (www) AS c3w
, LEAD ( c1.word, 3) OVER (www) AS c4w
, LEAD (c1.lemma, 3) OVER (www) AS c4l
, LEAD (c1.pos, 3) OVER (www) AS c4p
, LEAD (c1.pos, 4) OVER (www) AS c5p
, LEAD (c1.word, 4) OVER (www) AS c5w
, LEAD (c1.word, 5) OVER (www) AS c6w
, LEAD (c1.lemma, 5) OVER (www) AS c6l
, LEAD (c1.word, 6) OVER (www) AS c7w
, LEAD (c1.pos, 6) OVER (www) AS c7p
, LEAD (c1.word, 7) OVER (www) AS c8w
, LEAD (c1.word, 8) OVER (www) AS c9w
, LEAD (c1.lemma, 8) OVER (www) AS c9l
, LEAD (c1.pos, 8) OVER (www) AS c9p
, LEAD (c1.word, 9) OVER (www) AS c10w
, LEAD (c1.word, 10) OVER (www) AS c11w
FROM orderedflatcorpus AS c1
WINDOW www AS (ORDER BY id)
) stuff
WHERE 1=1
AND c4p LIKE 'v%'
AND c5p = 'appge'
AND c6l = 'way'
AND c7p LIKE 'i%'
AND c8w = 'the'
AND c9p LIKE 'n%'
ORDER BY id
)
-- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header;
TO '/tmp/OUTPUT2a.csv' DELIMITER E'\t' csv header;
[可以对第二个查询执行类似的转换]
UPDATE2 两个 table 变体的子查询版本。
-- copy(
-- EXPLAIN ANALYZE
SELECT c1i, c1s, c1w
, c2w , c3w
, c4w , c4l , c4p
, c5w
, c6w
, c7w
, c8w
, c9w , c9l , c9p
, c10w
, c11w
FROM (
SELECT c1.id AS c1i
, c1.source AS c1s
, w1.word AS c1w
, LEAD (w1.word, 1) OVER www AS c2w
, LEAD (w1.word, 2) OVER www AS c3w
, LEAD (w1.word, 3) OVER www AS c4w
, LEAD (w1.lemma, 3) OVER www AS c4l
, LEAD (w1.pos, 3) OVER www AS c4p
, LEAD (w1.pos, 4) OVER www AS c5p
, LEAD (w1.word, 4) OVER www AS c5w
, LEAD (w1.word, 5) OVER www AS c6w
, LEAD (w1.lemma, 5) OVER www AS c6l
, LEAD (w1.word, 6) OVER www AS c7w
, LEAD (w1.pos, 6) OVER www AS c7p
, LEAD (w1.word, 7) OVER www AS c8w
, LEAD (w1.word, 8) OVER www AS c9w
, LEAD (w1.lemma, 8) OVER www AS c9l
, LEAD (w1.pos, 8) OVER www AS c9p
, LEAD (w1.word, 9) OVER www AS c10w
, LEAD (w1.word, 10) OVER www AS c11w
FROM orderedflatcorpus c1
JOIN words w1 ON w1.id=c1.words_id
WHERE 1=1
/* These *could* to prune out unmatched items, but I could not get it to work ...
AND EXISTS (SELECT *FROM orderedflatcorpus c4 JOIN words w4 ON w4.id=c4.words_id
WHERE c4.id = 3+c1.id -- AND w4.pos LIKE 'v%'
) -- OMG
AND EXISTS (SELECT *FROM orderedflatcorpus c5 JOIN words w5 ON w5.id=c5.words_id
WHERE c5.id = 4+c1.id -- AND w5.pos = 'appge'
) -- OMG
AND EXISTS (SELECT *FROM orderedflatcorpus c7 JOIN words w7 ON w7.id=c7.words_id
WHERE c7.id = 6+c1.id -- AND w7.pos LIKE 'i%'
) -- OMG
AND EXISTS (SELECT *FROM orderedflatcorpus c9 JOIN words w9 ON w9.id=c9.words_id
WHERE c9.id = 8+c1.id -- AND w9.pos LIKE 'n%'
) -- OMG
AND EXISTS (SELECT *FROM orderedflatcorpus c8 JOIN words w8 ON w8.id=c8.words_id
WHERE c8.id = 7+c1.id -- AND w8.word = 'the'
) -- OMG
*/
WINDOW www AS (ORDER BY c1.id ROWS BETWEEN CURRENT ROW AND 10 FOLLOWING)
) stuff
WHERE 1=1
AND c4p LIKE 'v%'
AND c5p = 'appge'
AND c6l = 'way'
AND c7p LIKE 'i%'
AND c8w = 'the'
AND c9p LIKE 'n%'
ORDER BY c1i
;
-- )
-- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header;
-- TO '/tmp/OUTPUT3b.csv' DELIMITER E'\t' csv header;
让我们尝试重新格式化您的查询,看看我们能看到什么。要做的第一件事是将其更改为使用 ANSI 样式的连接,以便我们可以清楚地看到关系是什么:
SELECT c1.source, c1.word, c2.word, c3.word, c4.word,
c4.lemma, c4.pos, c5.word, c6.word, c7.word,
c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM orderedflatcorpus c1
INNER JOIN orderedflatcorpus c2
ON c2.ID = c1.ID + 1 AND
c2.WORD LIKE '%'
INNER JOIN orderedflatcorpus c3
ON c3.ID = c1.ID + 2 AND
c3.WORD LIKE '%'
INNER JOIN orderedflatcorpus c4
ON c4.ID = c1.ID + 3 AND
c4.pos LIKE 'v%'
INNER JOIN orderedflatcorpus c5
ON c5.ID = c1.ID + 4 AND
c5.pos = 'appge'
INNER JOIN orderedflatcorpus c6
ON c6.ID = c1.ID + 5 AND
c6.lemma = 'way'
INNER JOIN orderedflatcorpus c7
ON c7.ID = c1.ID + 6 AND
c7.pos LIKE 'i%'
INNER JOIN orderedflatcorpus c8
ON c8.ID = c1.ID + 7 AND
c8.word = 'the'
INNER JOIN orderedflatcorpus c9
ON c9.ID = c1.ID + 8 AND
c9.pos LIKE 'n%'
INNER JOIN orderedflatcorpus c10
ON c10.ID = c1.ID + 9 AND
c10.WORD LIKE '%'
INNER JOIN orderedflatcorpus c11
ON c11.ID = c1.ID + 10 AND
c11.WORD LIKE '%'
WHERE c1.WORD LIKE '%'
ORDER BY c1.id
好的,首先 - 所有 LIKE 都将终止此查询。让我们尽可能地消除它们。我在这里假设在 ORDEREDFLATCORPUS 中 WORD 不能为 NULL,因此可以消除所有 IS LIKE '%'
条件:
SELECT c1.source, c1.word, c2.word, c3.word, c4.word,
c4.lemma, c4.pos, c5.word, c6.word, c7.word,
c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM orderedflatcorpus c1
INNER JOIN orderedflatcorpus c2
ON c2.ID = c1.ID + 1
INNER JOIN orderedflatcorpus c3
ON c3.ID = c1.ID + 2
INNER JOIN orderedflatcorpus c4
ON c4.ID = c1.ID + 3 AND
c4.pos LIKE 'v%'
INNER JOIN orderedflatcorpus c5
ON c5.ID = c1.ID + 4 AND
c5.pos = 'appge'
INNER JOIN orderedflatcorpus c6
ON c6.ID = c1.ID + 5 AND
c6.lemma = 'way'
INNER JOIN orderedflatcorpus c7
ON c7.ID = c1.ID + 6 AND
c7.pos LIKE 'i%'
INNER JOIN orderedflatcorpus c8
ON c8.ID = c1.ID + 7 AND
c8.word = 'the'
INNER JOIN orderedflatcorpus c9
ON c9.ID = c1.ID + 8 AND
c9.pos LIKE 'n%'
INNER JOIN orderedflatcorpus c10
ON c10.ID = c1.ID + 9
INNER JOIN orderedflatcorpus c11
ON c11.ID = c1.ID + 10
ORDER BY c1.id
如果 WORD 可以为 NULL,那么您可能需要使用:
SELECT c1.source, c1.word, c2.word, c3.word, c4.word,
c4.lemma, c4.pos, c5.word, c6.word, c7.word,
c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM orderedflatcorpus c1
INNER JOIN orderedflatcorpus c2
ON c2.ID = c1.ID + 1 AND
c2.WORD IS NOT NULL
INNER JOIN orderedflatcorpus c3
ON c3.ID = c1.ID + 2 AND
c3.WORD IS NOT NULL
INNER JOIN orderedflatcorpus c4
ON c4.ID = c1.ID + 3 AND
c4.pos LIKE 'v%'
INNER JOIN orderedflatcorpus c5
ON c5.ID = c1.ID + 4 AND
c5.pos = 'appge'
INNER JOIN orderedflatcorpus c6
ON c6.ID = c1.ID + 5 AND
c6.lemma = 'way'
INNER JOIN orderedflatcorpus c7
ON c7.ID = c1.ID + 6 AND
c7.pos LIKE 'i%'
INNER JOIN orderedflatcorpus c8
ON c8.ID = c1.ID + 7 AND
c8.word = 'the'
INNER JOIN orderedflatcorpus c9
ON c9.ID = c1.ID + 8 AND
c9.pos LIKE 'n%'
INNER JOIN orderedflatcorpus c10
ON c10.ID = c1.ID + 9 AND
c10.WORD IS NOT NULL
INNER JOIN orderedflatcorpus c11
ON c11.ID = c1.ID + 10 AND
c11.WORD IS NOT NULL
WHERE c1.WORD IS NOT NULL
ORDER BY c1.id
接下来 - 此查询需要尽可能早地进行尽可能多的过滤。数据库查询优化器 可能 能够解决这个问题,但让我们先将等值连接放在连接列表中,然后调整 ID 计算以反映我们的信息,从而给它一些帮助重新获得第一名:
SELECT c1.source, c1.word, c2.word, c3.word, c4.word,
c4.lemma, c4.pos, c5.word, c6.word, c7.word,
c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM DUAL
INNER JOIN orderedflatcorpus c5
ON c5.pos = 'appge'
INNER JOIN orderedflatcorpus c6
ON c6.ID = c5.ID + 1 AND
c6.lemma = 'way'
INNER JOIN orderedflatcorpus c8
ON c8.ID = c5.ID + 3 AND
c8.word = 'the'
INNER JOIN orderedflatcorpus c1
ON c1.ID = c5.ID - 4 AND
INNER JOIN orderedflatcorpus c2
ON c2.ID = c5.ID - 3
INNER JOIN orderedflatcorpus c3
ON c3.ID = c5.ID - 2
INNER JOIN orderedflatcorpus c4
ON c4.ID = c5.ID - 1 AND
c4.pos LIKE 'v%'
INNER JOIN orderedflatcorpus c7
ON c7.ID = c5.ID + 2 AND
c7.pos LIKE 'i%'
INNER JOIN orderedflatcorpus c9
ON c9.ID = c5.ID + 4 AND
c9.pos LIKE 'n%'
INNER JOIN orderedflatcorpus c10
ON c10.ID = c5.ID + 5
INNER JOIN orderedflatcorpus c11
ON c11.ID = c5.ID + 6
ORDER BY c1.id
接下来我们需要考虑哪些索引最有用。我认为以下索引值得拥有:
(ID)
(ID, WORD)
(ID, LEMMA)
(ID, POS)
将这些索引放在 运行 这个查询上,看看是否有帮助。另外,检查 ID 计算 - 我 认为 我做对了但是... :-)
祝你好运。
我有以下 psql table。它总共有大约 20 亿行。
id word lemma pos textid source
1 Stuffing stuff vvg 190568 AN
2 her her appge 190568 AN
3 key key nn1 190568 AN
4 into into ii 190568 AN
5 the the at 190568 AN
6 lock lock nn1 190568 AN
7 she she appge 190568 AN
8 pushed push vvd 190568 AN
9 her her appge 190568 AN
10 way way nn1 190568 AN
11 into into ii 190568 AN
12 the the appge 190568 AN
13 house house nn1 190568 AN
14 . . 190568 AN
15 She she appge 190568 AN
16 had have vhd 190568 AN
17 also also rr 190568 AN
18 cajoled cajole vvd 190568 AN
19 her her appge 190568 AN
20 way way nn1 190568 AN
21 into into ii 190568 AN
22 the the at 190568 AN
23 home home nn1 190568 AN
24 . . 190568 AN
.. ... ... .. ... ..
我想创建以下 table,它会并排显示所有 "way" 的结构,以及 "source"、[= 列中的一些数据31=] 和 "pos"。
source word word word lemma pos word word word word word lemma pos word word
AN lock she pushed push vvd her way into the house house nn1 . she
AN had also cajoled cajole vvd her way into the home home nn1 . A
AN tried to force force vvi her way into the palace palace nn1 , officials
在这里你可以看到我使用的代码:
copy(
SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM
orderedflatcorpus AS c1, orderedflatcorpus AS c2, orderedflatcorpus AS c3, orderedflatcorpus AS c4, orderedflatcorpus AS c5, orderedflatcorpus AS c6, orderedflatcorpus AS c7, orderedflatcorpus AS c8, orderedflatcorpus AS c9, orderedflatcorpus AS c10, orderedflatcorpus AS c11
WHERE
c1.word LIKE '%' AND
c2.word LIKE '%' AND
c3.word LIKE '%' AND
c4.pos LIKE 'v%' AND
c5.pos = 'appge' AND
c6.lemma = 'way' AND
c7.pos LIKE 'i%' AND
c8.word = 'the' AND
c9.pos LIKE 'n%' AND
c10.word LIKE '%' AND
c11.word LIKE '%'
AND
c1.id + 1 = c2.id AND c1.id + 2 = c3.id AND c1.id + 3 = c4.id AND c1.id + 4 = c5.id AND c1.id + 5 = c6.id AND c1.id + 6 = c7.id AND c1.id + 7 = c8.id AND c1.id + 8 = c9.id AND c1.id + 9 = c10.id AND c1.id + 10 = c11.id
ORDER BY c1.id
)
TO
'/home/postgres/Results/OUTPUT.csv'
DELIMITER E'\t'
csv header;
为 20 亿行(结果大约有 19,000 行)执行查询需要将近 9 个小时。
我可以做些什么来提高性能?
word、pos 和 lemma 列已经有 btree 索引。
我是否应该坚持我的代码并简单地使用一个更强大的服务器cores/a更快CPU和更多的内存(我的只有 8 GB 内存,只有 2 个内核和 2.8 GHz )?或者您会推荐一个不同的、更有效的 SQL 查询吗?
谢谢!
我建议使用现代连接语法,这可能会很好地解决问题:
SELECT
c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM orderedflatcorpus AS c1
JOIN orderedflatcorpus AS c2 ON c1.id + 1 = c2.id
JOIN orderedflatcorpus AS c3 ON c1.id + 2 = c3.id
JOIN orderedflatcorpus AS c4 ON c1.id + 3 = c4.id
JOIN orderedflatcorpus AS c5 ON c1.id + 4 = c5.id
JOIN orderedflatcorpus AS c6 ON c1.id + 5 = c6.id
JOIN orderedflatcorpus AS c7 ON c1.id + 6 = c7.id
JOIN orderedflatcorpus AS c8 ON c1.id + 7 = c8.id
JOIN orderedflatcorpus AS c9 ON c1.id + 8 = c9.id
JOIN orderedflatcorpus AS c10 ON c1.id + 9 = c10.id
JOIN orderedflatcorpus AS c11 ON c1.id + 10 = c11.id
WHERE c4.pos LIKE 'v%'
AND c5.pos = 'appge'
AND c6.lemma = 'way'
AND c7.pos LIKE 'i%'
AND c8.word = 'the'
AND c9.pos LIKE 'n%'
备注:
- 删除了冗余的
LIKE
ORDER BY
去掉了,因为太贵了。 CSV(如 table 行)不需要排序即可生效。如果您绝对需要排序,请在执行查询后使用命令行工具对其进行排序。
Step1:使用一个window函数获取相邻条记录,避免痛苦的自连接(12tables已经非常接近极限了其中 geqo 接管):
copy(
WITH stuff AS (
SELECT c1.id , c1.source, c1.word
, LEAD ( c1.word, 1) OVER (www) AS c2w
, LEAD (c1.word, 2) OVER (www) AS c3w
, LEAD ( c1.word, 3) OVER (www) AS c4w
, LEAD (c1.lemma, 3) OVER (www) AS c4l
, LEAD (c1.pos, 3) OVER (www) AS c4p
, LEAD (c1.pos, 4) OVER (www) AS c5p
, LEAD (c1.word, 4) OVER (www) AS c5w
, LEAD (c1.word, 5) OVER (www) AS c6w
, LEAD (c1.lemma, 5) OVER (www) AS c6l
, LEAD (c1.word, 6) OVER (www) AS c7w
, LEAD (c1.pos, 6) OVER (www) AS c7p
, LEAD (c1.word, 7) OVER (www) AS c8w
, LEAD (c1.word, 8) OVER (www) AS c9w
, LEAD (c1.lemma, 8) OVER (www) AS c9l
, LEAD (c1.pos, 8) OVER (www) AS c9p
, LEAD (c1.word, 9) OVER (www) AS c10w
, LEAD (c1.word, 10) OVER (www) AS c11w
FROM orderedflatcorpus AS c1
WINDOW www AS (ORDER BY id)
)
SELECT id , source, word
, c2w
, c3w
, c4w
, c4l
, c4p
, c5w
, c6w
, c7w
, c8w
, c9w
, c9l
, c9p
, c10w
, c11w
FROM stuff
WHERE 1=1
AND c4p LIKE 'v%'
AND c5p = 'appge'
AND c6l = 'way'
AND c7p LIKE 'i%'
AND c8w = 'the'
AND c9p LIKE 'n%'
ORDER BY id
)
-- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header;
TO '/tmp/OUTPUT2.csv' DELIMITER E'\t' csv header;
第 2 步:[数据模型] {word,lemma, pos} 列似乎是一个低基数组,您可以将它们压缩到一个单独的 token/lemma/pos-table:
-- An index to speedup the unique extraction and final update
-- (the index will be dropped automatically
-- once the columns are dropped)
CREATE INDEX ON tmp.orderedflatcorpus (word, lemma, pos );
ANALYZE tmp.orderedflatcorpus;
-- table containing the "squeezed out" domain
CREATE TABLE tmp.words AS
SELECT DISTINCT word, lemma, pos
FROM tmp.orderedflatcorpus
;
ALTER TABLE tmp.words
ADD COLUMN id SERIAL NOT NULL PRIMARY KEY;
ALTER TABLE tmp.words
ADD UNIQUE (word , lemma, pos );
-- The original table needs an FK "link" to the new table
ALTER TABLE tmp.orderedflatcorpus
ADD column words_id INTEGER -- NOT NULL
REFERENCES tmp.words(id)
;
-- FK constraints are helped a lot by a supportive index.
CREATE INDEX orderedflatcorpus_words_id_fk ON tmp.orderedflatcorpus (words_id)
;
ANALYZE tmp.orderedflatcorpus;
ANALYZE tmp.words;
-- Initialize the FK column in the original table.
-- we need NOT DISTINCT FROM here, since the joined
-- columns could contain NULLs , which MUST compare equal.
-- ------------------------------------------------------
UPDATE tmp.orderedflatcorpus dst
SET words_id = src.id
FROM tmp.words src
WHERE src.word IS NOT DISTINCT FROM dst.word
AND dst.lemma IS NOT DISTINCT FROM src.lemma
AND dst.pos IS NOT DISTINCT FROM src.pos
;
ALTER TABLE tmp.orderedflatcorpus
DROP column word
, DROP column lemma
, DROP column pos
;
以及新的查询,对单词进行 JOIN -table:
copy(
WITH stuff AS (
SELECT c1.id , c1.source, w.word
, LEAD ( w.word, 1) OVER (www) AS c2w
, LEAD (w.word, 2) OVER (www) AS c3w
, LEAD ( w.word, 3) OVER (www) AS c4w
, LEAD (w.lemma, 3) OVER (www) AS c4l
, LEAD (w.pos, 3) OVER (www) AS c4p
, LEAD (w.pos, 4) OVER (www) AS c5p
, LEAD (w.word, 4) OVER (www) AS c5w
, LEAD (w.word, 5) OVER (www) AS c6w
, LEAD (w.lemma, 5) OVER (www) AS c6l
, LEAD (w.word, 6) OVER (www) AS c7w
, LEAD (w.pos, 6) OVER (www) AS c7p
, LEAD (w.word, 7) OVER (www) AS c8w
, LEAD (w.word, 8) OVER (www) AS c9w
, LEAD (w.lemma, 8) OVER (www) AS c9l
, LEAD (w.pos, 8) OVER (www) AS c9p
, LEAD (w.word, 9) OVER (www) AS c10w
, LEAD (w.word, 10) OVER (www) AS c11w
FROM orderedflatcorpus AS c1
JOIN words w ON w.id=c1.words_id
WINDOW www AS (ORDER BY c1.id)
)
SELECT id , source, word
, c2w , c3w
, c4w , c4l , c4p
, c5w
, c6w
, c7w
, c8w
, c9w , c9l , c9p
, c10w
, c11w
FROM stuff
WHERE 1=1
AND c4p LIKE 'v%'
AND c5p = 'appge'
AND c6l = 'way'
AND c7p LIKE 'i%'
AND c8w = 'the'
AND c9p LIKE 'n%'
ORDER BY id
)
-- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header;
TO '/tmp/OUTPUT3.csv' DELIMITER E'\t' csv header;
注意:我在输出中得到两行,因为我把条件放宽了一点......
Update :第一次查询,避免CTE
copy(
SELECT id , source, word
, c2w
, c3w
, c4w
, c4l
, c4p
, c5w
, c6w
, c7w
, c8w
, c9w
, c9l
, c9p
, c10w
, c11w
FROM (
SELECT c1.id , c1.source, c1.word
, LEAD ( c1.word, 1) OVER (www) AS c2w
, LEAD (c1.word, 2) OVER (www) AS c3w
, LEAD ( c1.word, 3) OVER (www) AS c4w
, LEAD (c1.lemma, 3) OVER (www) AS c4l
, LEAD (c1.pos, 3) OVER (www) AS c4p
, LEAD (c1.pos, 4) OVER (www) AS c5p
, LEAD (c1.word, 4) OVER (www) AS c5w
, LEAD (c1.word, 5) OVER (www) AS c6w
, LEAD (c1.lemma, 5) OVER (www) AS c6l
, LEAD (c1.word, 6) OVER (www) AS c7w
, LEAD (c1.pos, 6) OVER (www) AS c7p
, LEAD (c1.word, 7) OVER (www) AS c8w
, LEAD (c1.word, 8) OVER (www) AS c9w
, LEAD (c1.lemma, 8) OVER (www) AS c9l
, LEAD (c1.pos, 8) OVER (www) AS c9p
, LEAD (c1.word, 9) OVER (www) AS c10w
, LEAD (c1.word, 10) OVER (www) AS c11w
FROM orderedflatcorpus AS c1
WINDOW www AS (ORDER BY id)
) stuff
WHERE 1=1
AND c4p LIKE 'v%'
AND c5p = 'appge'
AND c6l = 'way'
AND c7p LIKE 'i%'
AND c8w = 'the'
AND c9p LIKE 'n%'
ORDER BY id
)
-- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header;
TO '/tmp/OUTPUT2a.csv' DELIMITER E'\t' csv header;
[可以对第二个查询执行类似的转换]
UPDATE2 两个 table 变体的子查询版本。
-- copy(
-- EXPLAIN ANALYZE
SELECT c1i, c1s, c1w
, c2w , c3w
, c4w , c4l , c4p
, c5w
, c6w
, c7w
, c8w
, c9w , c9l , c9p
, c10w
, c11w
FROM (
SELECT c1.id AS c1i
, c1.source AS c1s
, w1.word AS c1w
, LEAD (w1.word, 1) OVER www AS c2w
, LEAD (w1.word, 2) OVER www AS c3w
, LEAD (w1.word, 3) OVER www AS c4w
, LEAD (w1.lemma, 3) OVER www AS c4l
, LEAD (w1.pos, 3) OVER www AS c4p
, LEAD (w1.pos, 4) OVER www AS c5p
, LEAD (w1.word, 4) OVER www AS c5w
, LEAD (w1.word, 5) OVER www AS c6w
, LEAD (w1.lemma, 5) OVER www AS c6l
, LEAD (w1.word, 6) OVER www AS c7w
, LEAD (w1.pos, 6) OVER www AS c7p
, LEAD (w1.word, 7) OVER www AS c8w
, LEAD (w1.word, 8) OVER www AS c9w
, LEAD (w1.lemma, 8) OVER www AS c9l
, LEAD (w1.pos, 8) OVER www AS c9p
, LEAD (w1.word, 9) OVER www AS c10w
, LEAD (w1.word, 10) OVER www AS c11w
FROM orderedflatcorpus c1
JOIN words w1 ON w1.id=c1.words_id
WHERE 1=1
/* These *could* to prune out unmatched items, but I could not get it to work ...
AND EXISTS (SELECT *FROM orderedflatcorpus c4 JOIN words w4 ON w4.id=c4.words_id
WHERE c4.id = 3+c1.id -- AND w4.pos LIKE 'v%'
) -- OMG
AND EXISTS (SELECT *FROM orderedflatcorpus c5 JOIN words w5 ON w5.id=c5.words_id
WHERE c5.id = 4+c1.id -- AND w5.pos = 'appge'
) -- OMG
AND EXISTS (SELECT *FROM orderedflatcorpus c7 JOIN words w7 ON w7.id=c7.words_id
WHERE c7.id = 6+c1.id -- AND w7.pos LIKE 'i%'
) -- OMG
AND EXISTS (SELECT *FROM orderedflatcorpus c9 JOIN words w9 ON w9.id=c9.words_id
WHERE c9.id = 8+c1.id -- AND w9.pos LIKE 'n%'
) -- OMG
AND EXISTS (SELECT *FROM orderedflatcorpus c8 JOIN words w8 ON w8.id=c8.words_id
WHERE c8.id = 7+c1.id -- AND w8.word = 'the'
) -- OMG
*/
WINDOW www AS (ORDER BY c1.id ROWS BETWEEN CURRENT ROW AND 10 FOLLOWING)
) stuff
WHERE 1=1
AND c4p LIKE 'v%'
AND c5p = 'appge'
AND c6l = 'way'
AND c7p LIKE 'i%'
AND c8w = 'the'
AND c9p LIKE 'n%'
ORDER BY c1i
;
-- )
-- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header;
-- TO '/tmp/OUTPUT3b.csv' DELIMITER E'\t' csv header;
让我们尝试重新格式化您的查询,看看我们能看到什么。要做的第一件事是将其更改为使用 ANSI 样式的连接,以便我们可以清楚地看到关系是什么:
SELECT c1.source, c1.word, c2.word, c3.word, c4.word,
c4.lemma, c4.pos, c5.word, c6.word, c7.word,
c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM orderedflatcorpus c1
INNER JOIN orderedflatcorpus c2
ON c2.ID = c1.ID + 1 AND
c2.WORD LIKE '%'
INNER JOIN orderedflatcorpus c3
ON c3.ID = c1.ID + 2 AND
c3.WORD LIKE '%'
INNER JOIN orderedflatcorpus c4
ON c4.ID = c1.ID + 3 AND
c4.pos LIKE 'v%'
INNER JOIN orderedflatcorpus c5
ON c5.ID = c1.ID + 4 AND
c5.pos = 'appge'
INNER JOIN orderedflatcorpus c6
ON c6.ID = c1.ID + 5 AND
c6.lemma = 'way'
INNER JOIN orderedflatcorpus c7
ON c7.ID = c1.ID + 6 AND
c7.pos LIKE 'i%'
INNER JOIN orderedflatcorpus c8
ON c8.ID = c1.ID + 7 AND
c8.word = 'the'
INNER JOIN orderedflatcorpus c9
ON c9.ID = c1.ID + 8 AND
c9.pos LIKE 'n%'
INNER JOIN orderedflatcorpus c10
ON c10.ID = c1.ID + 9 AND
c10.WORD LIKE '%'
INNER JOIN orderedflatcorpus c11
ON c11.ID = c1.ID + 10 AND
c11.WORD LIKE '%'
WHERE c1.WORD LIKE '%'
ORDER BY c1.id
好的,首先 - 所有 LIKE 都将终止此查询。让我们尽可能地消除它们。我在这里假设在 ORDEREDFLATCORPUS 中 WORD 不能为 NULL,因此可以消除所有 IS LIKE '%'
条件:
SELECT c1.source, c1.word, c2.word, c3.word, c4.word,
c4.lemma, c4.pos, c5.word, c6.word, c7.word,
c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM orderedflatcorpus c1
INNER JOIN orderedflatcorpus c2
ON c2.ID = c1.ID + 1
INNER JOIN orderedflatcorpus c3
ON c3.ID = c1.ID + 2
INNER JOIN orderedflatcorpus c4
ON c4.ID = c1.ID + 3 AND
c4.pos LIKE 'v%'
INNER JOIN orderedflatcorpus c5
ON c5.ID = c1.ID + 4 AND
c5.pos = 'appge'
INNER JOIN orderedflatcorpus c6
ON c6.ID = c1.ID + 5 AND
c6.lemma = 'way'
INNER JOIN orderedflatcorpus c7
ON c7.ID = c1.ID + 6 AND
c7.pos LIKE 'i%'
INNER JOIN orderedflatcorpus c8
ON c8.ID = c1.ID + 7 AND
c8.word = 'the'
INNER JOIN orderedflatcorpus c9
ON c9.ID = c1.ID + 8 AND
c9.pos LIKE 'n%'
INNER JOIN orderedflatcorpus c10
ON c10.ID = c1.ID + 9
INNER JOIN orderedflatcorpus c11
ON c11.ID = c1.ID + 10
ORDER BY c1.id
如果 WORD 可以为 NULL,那么您可能需要使用:
SELECT c1.source, c1.word, c2.word, c3.word, c4.word,
c4.lemma, c4.pos, c5.word, c6.word, c7.word,
c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM orderedflatcorpus c1
INNER JOIN orderedflatcorpus c2
ON c2.ID = c1.ID + 1 AND
c2.WORD IS NOT NULL
INNER JOIN orderedflatcorpus c3
ON c3.ID = c1.ID + 2 AND
c3.WORD IS NOT NULL
INNER JOIN orderedflatcorpus c4
ON c4.ID = c1.ID + 3 AND
c4.pos LIKE 'v%'
INNER JOIN orderedflatcorpus c5
ON c5.ID = c1.ID + 4 AND
c5.pos = 'appge'
INNER JOIN orderedflatcorpus c6
ON c6.ID = c1.ID + 5 AND
c6.lemma = 'way'
INNER JOIN orderedflatcorpus c7
ON c7.ID = c1.ID + 6 AND
c7.pos LIKE 'i%'
INNER JOIN orderedflatcorpus c8
ON c8.ID = c1.ID + 7 AND
c8.word = 'the'
INNER JOIN orderedflatcorpus c9
ON c9.ID = c1.ID + 8 AND
c9.pos LIKE 'n%'
INNER JOIN orderedflatcorpus c10
ON c10.ID = c1.ID + 9 AND
c10.WORD IS NOT NULL
INNER JOIN orderedflatcorpus c11
ON c11.ID = c1.ID + 10 AND
c11.WORD IS NOT NULL
WHERE c1.WORD IS NOT NULL
ORDER BY c1.id
接下来 - 此查询需要尽可能早地进行尽可能多的过滤。数据库查询优化器 可能 能够解决这个问题,但让我们先将等值连接放在连接列表中,然后调整 ID 计算以反映我们的信息,从而给它一些帮助重新获得第一名:
SELECT c1.source, c1.word, c2.word, c3.word, c4.word,
c4.lemma, c4.pos, c5.word, c6.word, c7.word,
c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word
FROM DUAL
INNER JOIN orderedflatcorpus c5
ON c5.pos = 'appge'
INNER JOIN orderedflatcorpus c6
ON c6.ID = c5.ID + 1 AND
c6.lemma = 'way'
INNER JOIN orderedflatcorpus c8
ON c8.ID = c5.ID + 3 AND
c8.word = 'the'
INNER JOIN orderedflatcorpus c1
ON c1.ID = c5.ID - 4 AND
INNER JOIN orderedflatcorpus c2
ON c2.ID = c5.ID - 3
INNER JOIN orderedflatcorpus c3
ON c3.ID = c5.ID - 2
INNER JOIN orderedflatcorpus c4
ON c4.ID = c5.ID - 1 AND
c4.pos LIKE 'v%'
INNER JOIN orderedflatcorpus c7
ON c7.ID = c5.ID + 2 AND
c7.pos LIKE 'i%'
INNER JOIN orderedflatcorpus c9
ON c9.ID = c5.ID + 4 AND
c9.pos LIKE 'n%'
INNER JOIN orderedflatcorpus c10
ON c10.ID = c5.ID + 5
INNER JOIN orderedflatcorpus c11
ON c11.ID = c5.ID + 6
ORDER BY c1.id
接下来我们需要考虑哪些索引最有用。我认为以下索引值得拥有:
(ID)
(ID, WORD)
(ID, LEMMA)
(ID, POS)
将这些索引放在 运行 这个查询上,看看是否有帮助。另外,检查 ID 计算 - 我 认为 我做对了但是... :-)
祝你好运。