需要帮助识别 table 中的重复项
Need help identifying dups in the table
我有:
data_source_1
table
data_source_2
table
data_sources_view
查看
关于 tables:
data_source_1
:
没有重复:
db=# select count(*) from (select distinct * from data_source_1);
count
--------
543243
(1 row)
db=# select count(*) from (select * from data_source_1);
count
--------
543243
(1 row)
data_source_2
:
没有重复:
db=# select count(*) from (select * from data_source_2);
count
-------
5304
(1 row)
db=# select count(*) from (select distinct * from data_source_2);
count
-------
5304
(1 row)
data_sources_view
:
有重复:
db=# select count(*) from (select distinct * from data_sources_vie);
count
--------
538714
(1 row)
db=# select count(*) from (select * from data_sources_view);
count
--------
548547
(1 row)
视图很简单:
CREATE VIEW data_sources_view
AS SELECT *
FROM (
(
SELECT a, b, 'data_source_1' as source
FROM data_source_1
)
UNION ALL
(
SELECT a, b, 'data_source_2' as source
FROM data_source_2
)
);
我想知道的:
- 在源 tables 没有重复数据的视图中怎么可能出现重复数据 +
'data_source_x' as source
消除了重叠数据的可能性。
- 如何识别抄袭?
我尝试过的:
db# create table t1 as select * from data_sources_view;
SELECT
db=#
db=# create table t2 as select distinct * from data_sources_view;
SELECT
db=# create table t3 as select * from t1 minus select * from t2;
SELECT
db=# select 't1' as table_name, count(*) from t1 UNION ALL
db-# select 't2' as table_name, count(*) from t2 UNION ALL
db-# select 't3' as table_name, count(*) from t3;
table_name | count
------------+--------
t1 | 548547
t3 | 0
t2 | 538714
(3 rows)
数据库:
红移 (PostgreSQL
)
原因是因为你的数据源有多于两列。如果你做这些计数:
select count(*) from (select distinct a, b from data_source_1);
和
select count(*) from (select distinct a, b from data_source_2);
您应该会发现它们与您在 table 上获得的 count(*)
不同。
UNION 与 UNION ALL
- UNION - 如果数据存在于 TOP 查询中,它在底部查询中被抑制。
输出
FOO
- UNION ALL - 数据重复,因为数据存在于两个表中(显示两个记录)
输出
FOO
FOO
我有:
data_source_1
tabledata_source_2
tabledata_sources_view
查看
关于 tables:
data_source_1
:
没有重复:
db=# select count(*) from (select distinct * from data_source_1);
count
--------
543243
(1 row)
db=# select count(*) from (select * from data_source_1);
count
--------
543243
(1 row)
data_source_2
:
没有重复:
db=# select count(*) from (select * from data_source_2);
count
-------
5304
(1 row)
db=# select count(*) from (select distinct * from data_source_2);
count
-------
5304
(1 row)
data_sources_view
:
有重复:
db=# select count(*) from (select distinct * from data_sources_vie);
count
--------
538714
(1 row)
db=# select count(*) from (select * from data_sources_view);
count
--------
548547
(1 row)
视图很简单:
CREATE VIEW data_sources_view
AS SELECT *
FROM (
(
SELECT a, b, 'data_source_1' as source
FROM data_source_1
)
UNION ALL
(
SELECT a, b, 'data_source_2' as source
FROM data_source_2
)
);
我想知道的:
- 在源 tables 没有重复数据的视图中怎么可能出现重复数据 +
'data_source_x' as source
消除了重叠数据的可能性。 - 如何识别抄袭?
我尝试过的:
db# create table t1 as select * from data_sources_view;
SELECT
db=#
db=# create table t2 as select distinct * from data_sources_view;
SELECT
db=# create table t3 as select * from t1 minus select * from t2;
SELECT
db=# select 't1' as table_name, count(*) from t1 UNION ALL
db-# select 't2' as table_name, count(*) from t2 UNION ALL
db-# select 't3' as table_name, count(*) from t3;
table_name | count
------------+--------
t1 | 548547
t3 | 0
t2 | 538714
(3 rows)
数据库:
红移 (PostgreSQL
)
原因是因为你的数据源有多于两列。如果你做这些计数:
select count(*) from (select distinct a, b from data_source_1);
和
select count(*) from (select distinct a, b from data_source_2);
您应该会发现它们与您在 table 上获得的 count(*)
不同。
UNION 与 UNION ALL
- UNION - 如果数据存在于 TOP 查询中,它在底部查询中被抑制。
输出
FOO
- UNION ALL - 数据重复,因为数据存在于两个表中(显示两个记录)
输出
FOO
FOO