检查 table 字段 postgres 的唯一性

Question

在某些情况下（大数据集存储在 table 中），我需要检查 Postgres 字段的唯一性 table。

为简化起见，假设我有以下 table：

id   |    name
--------------
1    |  david
2    |  catrine
3    |  hmida

并且我想检查字段名称的唯一性；结果会是真的到目前为止，我设法使用了与以下类似的代码：

select name, count(*)
from test
group by name
having count(*) > 1

记住我有一个大数据集，所以我更喜欢由 RDBMS 处理它，而不是通过适配器（例如 psycopg2）获取数据。所以我再次需要尽可能地优化。有什么书呆子的想法吗？

Answer 1

这可能会更快，但不太可靠的解决方案：

t=# create table t (i int);
CREATE TABLE
t=# insert into t select generate_series(1,9,1);
INSERT 0 9
t=# insert into t select generate_series(1,999999,1);
INSERT 0 999999
t=# insert into t select generate_series(1,9999999,1);
INSERT 0 9999999

现在您的查询：

t=# select i,count(*) from t group by i having count(*) > 1 order by 2 desc,1 limit 1;
 i | count
---+-------
 1 |     3
(1 row)

Time: 7538.476 ms

正在检查统计信息：

t=# analyze t;
ANALYZE
Time: 1079.465 ms
    t=# with fr as (select most_common_vals::text::text[] from pg_stats where tablename = 't' and attname='i')
    select count(1),i from t join fr on true where i::text = any(most_common_vals) group by i;
     count |   i
    -------+--------
         2 |  94933
         2 | 196651
         2 | 242894
         2 | 313829
         2 | 501027
         2 | 757714
         2 | 778442
         2 | 896602
         2 | 929918
         2 | 979650
         2 | 999259
    (11 rows)

    Time: 3584.582 ms

最后只是检查是否不存在 uniq 仅存在最常见的值之一：

t=# select count(1),i from t where i::text = (select (most_common_vals::text::text[])[1] from pg_stats where tablename = 't' and attname='i') group by i;
 count |  i
-------+------
     2 | 1540
(1 row)

Time: 1871.907 ms

更新

pg_stats 数据在 table 上收集统计信息后被修改。因此，您有可能没有关于数据分布的最新聚合统计信息。例如在我的示例中：

t=# delete from t where i = 1540;
DELETE 2
Time: 941.684 ms
t=# select count(1),i from t where i::text = (select (most_common_vals::text::text[])[1] from pg_stats where tablename = 't' and attname='i') group by i;
 count | i
-------+---
(0 rows)

Time: 1876.136 ms
t=# analyze t;
ANALYZE
Time: 77.108 ms
t=# select count(1),i from t where i::text = (select (most_common_vals::text::text[])[1] from pg_stats where tablename = 't' and attname='i') group by i;
 count |   i
-------+-------
     2 | 41377
(1 row)

Time: 1878.260 ms

当然，如果您依赖的不仅仅是一个最常见的值，失败的几率就会降低，但同样 - 这种方法取决于统计数据 "freshness"。

检查 table 字段 postgres 的唯一性

check the unicity of a table field postgres

sql

postgresql

postgresql-9.1