在 PostgreSQL 中按字母顺序对字符串中的字母进行排序
To sort the letters in a string alphabetically in PostgreSQL
我目前正在使用这种方法在 PostgreSQL 中按字母顺序对字符串中的字母进行排序。还有其他有效的方法吗?
select string_agg(c, '') as s
from (select unnest(regexp_split_to_array('ijsAafhareDbv', '')) as c
order by c) as t;
s
--------------
ADaabefhijrsv
如果你想要一个没有正则表达式的解决方案,你可以使用这个:
WITH t(s) AS (VALUES ('amfjwzeils'))
SELECT string_agg(substr(t.s, g.g, 1), ''
ORDER BY substr(t.s, g.g, 1)
)
FROM t
CROSS JOIN LATERAL generate_series(1, length(t.s)) g;
string_agg
------------
aefijlmswz
(1 row)
我将对哪种解决方案进行基准测试。
我创建了 3 个函数,一个使用我的查询,另一个使用 Laurenz 的查询,还有一个:我创建了一个 Python(plpythonu) 函数用于排序。最后,我创建了一个包含 100000 行的 table(我现在是在我的 Mac 笔记本电脑上完成的)
每个包含一个随机的 15 个字符的字符串,在此 Link
中使用 random_string
函数生成
create table t as select random_string(15) as s FROM generate_series(1,100000);
这是 3 个函数。
CREATE or REPLACE FUNCTION sort1(x TEXT) RETURNS TEXT AS $$
select string_agg(c, '') as s
from (select unnest(regexp_split_to_array(, '')) as c
order by c) as t;
$$ LANGUAGE SQL IMMUTABLE;
CREATE or REPLACE FUNCTION sort2(x TEXT) RETURNS TEXT AS $$
WITH t(s) AS (VALUES ())
SELECT string_agg(substr(t.s, g.g, 1), ''
ORDER BY substr(t.s, g.g, 1)
)
FROM t
CROSS JOIN LATERAL generate_series(1, length(t.s)) g;
$$ LANGUAGE SQL IMMUTABLE;
create language plpythonu;
CREATE or REPLACE FUNCTION pysort(x text)
RETURNS text
AS $$
return ''.join(sorted(x))
$$ LANGUAGE plpythonu IMMUTABLE;
这些是所有三个 EXPLAIN ANALYSE
的结果。
knayak=# EXPLAIN ANALYSE select sort1(s) FROM t;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..26541.00 rows=100000 width=32) (actual time=0.266..7097.740 rows=100000 loops=1)
Planning time: 0.119 ms
Execution time: 7106.871 ms
(3 rows)
knayak=# EXPLAIN ANALYSE select sort2(s) FROM t;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..26541.00 rows=100000 width=32) (actual time=0.418..7012.935 rows=100000 loops=1)
Planning time: 0.270 ms
Execution time: 7021.587 ms
(3 rows)
knayak=# EXPLAIN ANALYSE select pysort(s) FROM t;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..26541.00 rows=100000 width=32) (actual time=0.060..389.729 rows=100000 loops=1)
Planning time: 0.048 ms
Execution time: 395.760 ms
(3 rows)
根据这个分析,事实证明 - Python sort 是最快的,前两个之间没有显着差异。需要实时检查性能尽管在我们的系统中有巨大的 tables。
在 C
中实现的功能 比我们用 LANGUAGE sql
或 plpgsql
实现的任何功能快得多 。所以以微弱优势赢得表演比赛。
但是 plpythonu
是一种 不受信任的 过程语言。它不是默认安装的,只有超级用户才能使用不受信任的语言创建函数。您需要了解安全隐患。不受信任的语言在大多数云服务上根本不可用。
The current manual (quote from pg 10):
PL/Python is only available as an “untrusted” language, meaning it
does not offer any way of restricting what users can do in it and is
therefore named plpythonu
. A trusted variant plpython
might become
available in the future if a secure execution mechanism is developed
in Python. The writer of a function in untrusted PL/Python must take
care that the function cannot be used to do anything unwanted, since
it will be able to do anything that could be done by a user logged in
as the database administrator. Only superusers can create functions in
untrusted languages such as plpythonu
.
您测试的 SQL 功能没有得到很好的优化。有一千零一种 提高性能的方法,但是:
演示
-- func to create random strings
CREATE OR REPLACE FUNCTION f_random_string(int)
RETURNS text AS
$func$
SELECT array_to_string(ARRAY(
SELECT substr('0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', (ceil(random()*62))::int, 1)
FROM generate_series(1, )
), '')
$func$ LANGUAGE sql VOLATILE;
-- test tbl with 100K rows
CREATE TABLE tbl(str text);
INSERT INTO tbl
SELECT f_random_string(15)
FROM generate_series(1, 100000) g;
VACUUM ANALYZE tbl;
-- 1: your test function 1 (inefficient)
CREATE OR REPLACE FUNCTION sort1(text) RETURNS text AS
$func$ -- your test function 1 (very inefficient)
SELECT string_agg(c, '')
FROM (SELECT unnest(regexp_split_to_array(, '')) AS c ORDER BY c) t;
$func$ LANGUAGE sql IMMUTABLE;
-- 2: your test function 2 ( inefficient)
CREATE OR REPLACE FUNCTION sort2(text) RETURNS text AS
$func$
WITH t(s) AS (VALUES ())
SELECT string_agg(substr(t.s, g.g, 1), '' ORDER BY substr(t.s, g.g, 1))
FROM t
CROSS JOIN LATERAL generate_series(1, length(t.s)) g;
$func$ LANGUAGE sql IMMUTABLE;
-- 3: remove pointless CTE from sort2
CREATE OR REPLACE FUNCTION sort3(text) RETURNS text AS
$func$
SELECT string_agg(substr(, g, 1), '' ORDER BY substr(, g, 1))
FROM generate_series(1, length()) g;
$func$ LANGUAGE sql IMMUTABLE;
-- 4: use unnest instead of calling substr N times
CREATE OR REPLACE FUNCTION sort4(text) RETURNS text AS
$func$
SELECT string_agg(c, '' ORDER BY c)
FROM unnest(string_to_array(, NULL)) c
$func$ LANGUAGE sql IMMUTABLE;
-- 5: ORDER BY in subquery
CREATE OR REPLACE FUNCTION sort5(text) RETURNS text AS
$func$
SELECT string_agg(c, '')
FROM (
SELECT c
FROM unnest(string_to_array(, NULL)) c
ORDER BY c
) sub
$func$ LANGUAGE sql IMMUTABLE;
-- 6: SRF in SELECT list
CREATE OR REPLACE FUNCTION sort6(text) RETURNS text AS
$func$
SELECT string_agg(c, '')
FROM (SELECT unnest(string_to_array(, NULL)) c ORDER BY 1) sub
$func$ LANGUAGE sql IMMUTABLE;
-- 7: ARRAY constructor instead of aggregate func
CREATE OR REPLACE FUNCTION sort7(text) RETURNS text AS
$func$
SELECT array_to_string(ARRAY(SELECT unnest(string_to_array(, NULL)) c ORDER BY c), '')
$func$ LANGUAGE sql IMMUTABLE;
-- 8: The same with COLLATE "C"
CREATE OR REPLACE FUNCTION sort8(text) RETURNS text AS
$func$
SELECT array_to_string(ARRAY(SELECT unnest(string_to_array( COLLATE "C", NULL)) c ORDER BY c), '')
$func$ LANGUAGE sql IMMUTABLE;
SELECT str, sort1(str), sort2(str), sort3(str), sort4(str), sort5(str), sort6(str), sort7(str), sort8(str) FROM tbl LIMIT 1; -- result sample
str | sort1 | sort2 | sort3 | sort4 | sort5 | sort6 | sort7 | sort8
:-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :--------------
tUkmori4D1rHhI1 | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DHIUhikmorrt
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort1(str) FROM tbl;
| QUERY PLAN |
| :--------------------------------------------------------------------------------------- |
| Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
| Planning time: 0.053 ms |
| Execution time: 2742.904 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort2(str) FROM tbl;
| QUERY PLAN |
| :--------------------------------------------------------------------------------------- |
| Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
| Planning time: 0.105 ms |
| Execution time: 2579.397 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort3(str) FROM tbl;
| QUERY PLAN |
| :--------------------------------------------------------------------------------------- |
| Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
| Planning time: 0.079 ms |
| Execution time: 2191.228 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort4(str) FROM tbl;
| QUERY PLAN |
| :--------------------------------------------------------------------------------------- |
| Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
| Planning time: 0.075 ms |
| Execution time: 2194.780 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort5(str) FROM tbl;
| QUERY PLAN |
| :--------------------------------------------------------------------------------------- |
| Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
| Planning time: 0.083 ms |
| Execution time: 1902.829 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort6(str) FROM tbl;
| QUERY PLAN |
| :--------------------------------------------------------------------------------------- |
| Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
| Planning time: 0.075 ms |
| Execution time: 1866.407 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort7(str) FROM tbl;
| QUERY PLAN |
| :--------------------------------------------------------------------------------------- |
| Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
| Planning time: 0.067 ms |
| Execution time: 1863.713 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort8(str) FROM tbl;
| QUERY PLAN |
| :--------------------------------------------------------------------------------------- |
| Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
| Planning time: 0.074 ms |
| Execution time: 1569.376 ms |
db<>fiddle here
最后一种没有COLLATION
规则排序,严格按照字符的字节值排序,这样便宜很多。但是您可能 需要 不同语言环境的排序顺序,也可能不需要。
我目前正在使用这种方法在 PostgreSQL 中按字母顺序对字符串中的字母进行排序。还有其他有效的方法吗?
select string_agg(c, '') as s
from (select unnest(regexp_split_to_array('ijsAafhareDbv', '')) as c
order by c) as t;
s
--------------
ADaabefhijrsv
如果你想要一个没有正则表达式的解决方案,你可以使用这个:
WITH t(s) AS (VALUES ('amfjwzeils'))
SELECT string_agg(substr(t.s, g.g, 1), ''
ORDER BY substr(t.s, g.g, 1)
)
FROM t
CROSS JOIN LATERAL generate_series(1, length(t.s)) g;
string_agg
------------
aefijlmswz
(1 row)
我将对哪种解决方案进行基准测试。
我创建了 3 个函数,一个使用我的查询,另一个使用 Laurenz 的查询,还有一个:我创建了一个 Python(plpythonu) 函数用于排序。最后,我创建了一个包含 100000 行的 table(我现在是在我的 Mac 笔记本电脑上完成的) 每个包含一个随机的 15 个字符的字符串,在此 Link
中使用random_string
函数生成
create table t as select random_string(15) as s FROM generate_series(1,100000);
这是 3 个函数。
CREATE or REPLACE FUNCTION sort1(x TEXT) RETURNS TEXT AS $$
select string_agg(c, '') as s
from (select unnest(regexp_split_to_array(, '')) as c
order by c) as t;
$$ LANGUAGE SQL IMMUTABLE;
CREATE or REPLACE FUNCTION sort2(x TEXT) RETURNS TEXT AS $$
WITH t(s) AS (VALUES ())
SELECT string_agg(substr(t.s, g.g, 1), ''
ORDER BY substr(t.s, g.g, 1)
)
FROM t
CROSS JOIN LATERAL generate_series(1, length(t.s)) g;
$$ LANGUAGE SQL IMMUTABLE;
create language plpythonu;
CREATE or REPLACE FUNCTION pysort(x text)
RETURNS text
AS $$
return ''.join(sorted(x))
$$ LANGUAGE plpythonu IMMUTABLE;
这些是所有三个 EXPLAIN ANALYSE
的结果。
knayak=# EXPLAIN ANALYSE select sort1(s) FROM t;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..26541.00 rows=100000 width=32) (actual time=0.266..7097.740 rows=100000 loops=1)
Planning time: 0.119 ms
Execution time: 7106.871 ms
(3 rows)
knayak=# EXPLAIN ANALYSE select sort2(s) FROM t;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..26541.00 rows=100000 width=32) (actual time=0.418..7012.935 rows=100000 loops=1)
Planning time: 0.270 ms
Execution time: 7021.587 ms
(3 rows)
knayak=# EXPLAIN ANALYSE select pysort(s) FROM t;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..26541.00 rows=100000 width=32) (actual time=0.060..389.729 rows=100000 loops=1)
Planning time: 0.048 ms
Execution time: 395.760 ms
(3 rows)
根据这个分析,事实证明 - Python sort 是最快的,前两个之间没有显着差异。需要实时检查性能尽管在我们的系统中有巨大的 tables。
在 C
中实现的功能 比我们用 LANGUAGE sql
或 plpgsql
实现的任何功能快得多 。所以
但是 plpythonu
是一种 不受信任的 过程语言。它不是默认安装的,只有超级用户才能使用不受信任的语言创建函数。您需要了解安全隐患。不受信任的语言在大多数云服务上根本不可用。
The current manual (quote from pg 10):
PL/Python is only available as an “untrusted” language, meaning it does not offer any way of restricting what users can do in it and is therefore named
plpythonu
. A trusted variantplpython
might become available in the future if a secure execution mechanism is developed in Python. The writer of a function in untrusted PL/Python must take care that the function cannot be used to do anything unwanted, since it will be able to do anything that could be done by a user logged in as the database administrator. Only superusers can create functions in untrusted languages such asplpythonu
.
您测试的 SQL 功能没有得到很好的优化。有一千零一种 提高性能的方法,但是:
演示
-- func to create random strings CREATE OR REPLACE FUNCTION f_random_string(int) RETURNS text AS $func$ SELECT array_to_string(ARRAY( SELECT substr('0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', (ceil(random()*62))::int, 1) FROM generate_series(1, ) ), '') $func$ LANGUAGE sql VOLATILE; -- test tbl with 100K rows CREATE TABLE tbl(str text); INSERT INTO tbl SELECT f_random_string(15) FROM generate_series(1, 100000) g;
VACUUM ANALYZE tbl;
-- 1: your test function 1 (inefficient) CREATE OR REPLACE FUNCTION sort1(text) RETURNS text AS $func$ -- your test function 1 (very inefficient) SELECT string_agg(c, '') FROM (SELECT unnest(regexp_split_to_array(, '')) AS c ORDER BY c) t; $func$ LANGUAGE sql IMMUTABLE; -- 2: your test function 2 ( inefficient) CREATE OR REPLACE FUNCTION sort2(text) RETURNS text AS $func$ WITH t(s) AS (VALUES ()) SELECT string_agg(substr(t.s, g.g, 1), '' ORDER BY substr(t.s, g.g, 1)) FROM t CROSS JOIN LATERAL generate_series(1, length(t.s)) g; $func$ LANGUAGE sql IMMUTABLE; -- 3: remove pointless CTE from sort2 CREATE OR REPLACE FUNCTION sort3(text) RETURNS text AS $func$ SELECT string_agg(substr(, g, 1), '' ORDER BY substr(, g, 1)) FROM generate_series(1, length()) g; $func$ LANGUAGE sql IMMUTABLE; -- 4: use unnest instead of calling substr N times CREATE OR REPLACE FUNCTION sort4(text) RETURNS text AS $func$ SELECT string_agg(c, '' ORDER BY c) FROM unnest(string_to_array(, NULL)) c $func$ LANGUAGE sql IMMUTABLE; -- 5: ORDER BY in subquery CREATE OR REPLACE FUNCTION sort5(text) RETURNS text AS $func$ SELECT string_agg(c, '') FROM ( SELECT c FROM unnest(string_to_array(, NULL)) c ORDER BY c ) sub $func$ LANGUAGE sql IMMUTABLE; -- 6: SRF in SELECT list CREATE OR REPLACE FUNCTION sort6(text) RETURNS text AS $func$ SELECT string_agg(c, '') FROM (SELECT unnest(string_to_array(, NULL)) c ORDER BY 1) sub $func$ LANGUAGE sql IMMUTABLE; -- 7: ARRAY constructor instead of aggregate func CREATE OR REPLACE FUNCTION sort7(text) RETURNS text AS $func$ SELECT array_to_string(ARRAY(SELECT unnest(string_to_array(, NULL)) c ORDER BY c), '') $func$ LANGUAGE sql IMMUTABLE; -- 8: The same with COLLATE "C" CREATE OR REPLACE FUNCTION sort8(text) RETURNS text AS $func$ SELECT array_to_string(ARRAY(SELECT unnest(string_to_array( COLLATE "C", NULL)) c ORDER BY c), '') $func$ LANGUAGE sql IMMUTABLE;
SELECT str, sort1(str), sort2(str), sort3(str), sort4(str), sort5(str), sort6(str), sort7(str), sort8(str) FROM tbl LIMIT 1; -- result sample
str | sort1 | sort2 | sort3 | sort4 | sort5 | sort6 | sort7 | sort8 :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :-------------- tUkmori4D1rHhI1 | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DHIUhikmorrt
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort1(str) FROM tbl;
| QUERY PLAN | | :--------------------------------------------------------------------------------------- | | Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) | | Planning time: 0.053 ms | | Execution time: 2742.904 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort2(str) FROM tbl;
| QUERY PLAN | | :--------------------------------------------------------------------------------------- | | Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) | | Planning time: 0.105 ms | | Execution time: 2579.397 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort3(str) FROM tbl;
| QUERY PLAN | | :--------------------------------------------------------------------------------------- | | Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) | | Planning time: 0.079 ms | | Execution time: 2191.228 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort4(str) FROM tbl;
| QUERY PLAN | | :--------------------------------------------------------------------------------------- | | Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) | | Planning time: 0.075 ms | | Execution time: 2194.780 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort5(str) FROM tbl;
| QUERY PLAN | | :--------------------------------------------------------------------------------------- | | Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) | | Planning time: 0.083 ms | | Execution time: 1902.829 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort6(str) FROM tbl;
| QUERY PLAN | | :--------------------------------------------------------------------------------------- | | Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) | | Planning time: 0.075 ms | | Execution time: 1866.407 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort7(str) FROM tbl;
| QUERY PLAN | | :--------------------------------------------------------------------------------------- | | Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) | | Planning time: 0.067 ms | | Execution time: 1863.713 ms |
EXPLAIN (ANALYZE, TIMING OFF) SELECT sort8(str) FROM tbl;
| QUERY PLAN | | :--------------------------------------------------------------------------------------- | | Seq Scan on tbl (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) | | Planning time: 0.074 ms | | Execution time: 1569.376 ms |
db<>fiddle here
最后一种没有COLLATION
规则排序,严格按照字符的字节值排序,这样便宜很多。但是您可能 需要 不同语言环境的排序顺序,也可能不需要。