如何找到带有多个标签的物品?
How to find items with multiple tags on them?
问题:
我有一个名为 item_tag_assn
的 table,它映射带有标签的项目(多对多关联 table)。我需要找出应用了一组标签的项目。例如,如果我的 table 有以下数据:
item_id | tag_id
------------------
205 | 110
206 | 120
207 | 130
205 | 130
206 | 147
210 | 110
205 | 152
209 | 111
210 | 177
205 | 147
212 | 110
212 | 135
205 | 135
212 | 147
------------------
如果我正在搜索
- 带有标签 110、135 和 147 的项目然后我希望结果集中的项目 #205 和 #212。
- 带有标签 110、130、135、147 和 152 的项目,那么我应该只得到项目 #205,因为只有 #205 有 所有这些标签 与之关联。
环境:
- PostgreSQL 9.5
- 不允许我向此 table 添加第三列或一共创建新的 table。
目前进度:
我找到了这样的解决方案:
SELECT DISTINCT ita1.item_id
FROM
item_tag_assn AS ita1
LEFT JOIN
item_tag_assn AS ita2 ON ita1.item_id = ita2.item_id
LEFT JOIN
item_tag_assn AS ita3 ON ita2.item_id = ita3.item_id
GROUP BY ita1.item_id
HAVING
sum((ita1.tag_id = 110 and ita2.tag_id = 135 and ita3.tag_id = 147)::integer) >= 1
并且有效。
需要优化
协会table相当大。将它与自身结合起来既昂贵又慢,而且它的可扩展性也不是很好。我认为 window 函数可以提供帮助,但我不知道如何使用它们。
有没有更好的方法解决这个问题?
如果我没理解错你需要这样的东西:
WITH search AS (
SELECT '{110,130,135,147,152}'::int4[] as search
), searched AS (
SELECT DISTINCT item_id,
tag_id
FROM item_tag_assn
JOIN search ON (tag_id) = ANY(search)
ORDER BY 1, 2
), aggregated AS (
SELECT item_id,
array_agg(tag_id) AS agg
FROM searched
GROUP BY 1
)
SELECT *
FROM aggregated, search
WHERE agg = search
;
search
- 设置搜索数组(数组必须预排序)。
searched
- 搜索标签以外的所有行
aggregated
- 根据 item_id
在数组 tag_id 中聚合
您可以将 agg = search
更改为 agg @> search
,之后您就不需要在 searched
.
中进行预排序和 ORDER BY
当添加问题的数据集时:
WITH item_tag_assn AS (
SELECT 205 as item_id, 110 as tag_id
UNION SELECT 206 , 120
UNION SELECT 207 , 130
UNION SELECT 205 , 130
UNION SELECT 206 , 147
UNION SELECT 210 , 110
UNION SELECT 205 , 152
UNION SELECT 209 , 111
UNION SELECT 210 , 177
UNION SELECT 205 , 147
UNION SELECT 212 , 110
UNION SELECT 212 , 135
UNION SELECT 205 , 135
UNION SELECT 212 , 147
),search AS (
SELECT '{110,130,135,147,152}'::int4[] as search
), searched AS (
SELECT DISTINCT item_id,
tag_id
FROM item_tag_assn
JOIN search ON (tag_id) = ANY(search)
ORDER BY 1, 2
), aggregated AS (
SELECT item_id,
array_agg(tag_id) AS agg
FROM searched
GROUP BY 1
)
SELECT *
FROM aggregated, search
WHERE agg = search
;
结果:
item_id | agg | search
---------+-----------------------+-----------------------
205 | {110,130,135,147,152} | {110,130,135,147,152}
(1 row)
如果将搜索更改为 '{110,135,147}'
:
item_id | agg | search
---------+---------------+---------------
212 | {110,135,147} | {110,135,147}
205 | {110,135,147} | {110,135,147}
(2 rows)
对于 运行 产品,您需要创建索引 CREATE INDEX ON item_tag_assn (tag_id);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=32.72..35.34 rows=1 width=68) (actual time=0.055..0.059 rows=3 loops=1)
Hash Cond: (aggregated.agg = search.search)
CTE search
-> Result (cost=0.00..0.01 rows=1 width=32) (actual time=0.001..0.001 rows=1 loops=1)
CTE searched
-> Unique (cost=27.73..28.55 rows=110 width=8) (actual time=0.029..0.031 rows=3 loops=1)
-> Sort (cost=27.73..28.00 rows=110 width=8) (actual time=0.029..0.029 rows=3 loops=1)
Sort Key: x.item_id, x.tag_id
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=10.40..24.00 rows=110 width=8) (actual time=0.013..0.014 rows=3 loops=1)
-> CTE Scan on search search_1 (cost=0.00..0.02 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=1)
-> Bitmap Heap Scan on x (cost=10.40..22.88 rows=110 width=8) (actual time=0.009..0.009 rows=3 loops=1)
Recheck Cond: (tag_id = ANY (search_1.search))
Heap Blocks: exact=1
-> Bitmap Index Scan on i1 (cost=0.00..10.38 rows=110 width=0) (actual time=0.002..0.002 rows=3 loops=1)
Index Cond: (tag_id = ANY (search_1.search))
CTE aggregated
-> HashAggregate (cost=2.75..4.12 rows=110 width=36) (actual time=0.038..0.039 rows=3 loops=1)
Group Key: searched.item_id
-> CTE Scan on searched (cost=0.00..2.20 rows=110 width=8) (actual time=0.029..0.031 rows=3 loops=1)
-> CTE Scan on aggregated (cost=0.00..2.20 rows=110 width=36) (actual time=0.040..0.043 rows=3 loops=1)
-> Hash (cost=0.02..0.02 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> CTE Scan on search (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.001 rows=1 loops=1)
Planning time: 0.309 ms
Execution time: 0.115 ms
问题:
我有一个名为 item_tag_assn
的 table,它映射带有标签的项目(多对多关联 table)。我需要找出应用了一组标签的项目。例如,如果我的 table 有以下数据:
item_id | tag_id
------------------
205 | 110
206 | 120
207 | 130
205 | 130
206 | 147
210 | 110
205 | 152
209 | 111
210 | 177
205 | 147
212 | 110
212 | 135
205 | 135
212 | 147
------------------
如果我正在搜索
- 带有标签 110、135 和 147 的项目然后我希望结果集中的项目 #205 和 #212。
- 带有标签 110、130、135、147 和 152 的项目,那么我应该只得到项目 #205,因为只有 #205 有 所有这些标签 与之关联。
环境:
- PostgreSQL 9.5
- 不允许我向此 table 添加第三列或一共创建新的 table。
目前进度:
我找到了这样的解决方案:
SELECT DISTINCT ita1.item_id
FROM
item_tag_assn AS ita1
LEFT JOIN
item_tag_assn AS ita2 ON ita1.item_id = ita2.item_id
LEFT JOIN
item_tag_assn AS ita3 ON ita2.item_id = ita3.item_id
GROUP BY ita1.item_id
HAVING
sum((ita1.tag_id = 110 and ita2.tag_id = 135 and ita3.tag_id = 147)::integer) >= 1
并且有效。
需要优化
协会table相当大。将它与自身结合起来既昂贵又慢,而且它的可扩展性也不是很好。我认为 window 函数可以提供帮助,但我不知道如何使用它们。
有没有更好的方法解决这个问题?
如果我没理解错你需要这样的东西:
WITH search AS (
SELECT '{110,130,135,147,152}'::int4[] as search
), searched AS (
SELECT DISTINCT item_id,
tag_id
FROM item_tag_assn
JOIN search ON (tag_id) = ANY(search)
ORDER BY 1, 2
), aggregated AS (
SELECT item_id,
array_agg(tag_id) AS agg
FROM searched
GROUP BY 1
)
SELECT *
FROM aggregated, search
WHERE agg = search
;
search
- 设置搜索数组(数组必须预排序)。
searched
- 搜索标签以外的所有行
aggregated
- 根据 item_id
您可以将 agg = search
更改为 agg @> search
,之后您就不需要在 searched
.
当添加问题的数据集时:
WITH item_tag_assn AS (
SELECT 205 as item_id, 110 as tag_id
UNION SELECT 206 , 120
UNION SELECT 207 , 130
UNION SELECT 205 , 130
UNION SELECT 206 , 147
UNION SELECT 210 , 110
UNION SELECT 205 , 152
UNION SELECT 209 , 111
UNION SELECT 210 , 177
UNION SELECT 205 , 147
UNION SELECT 212 , 110
UNION SELECT 212 , 135
UNION SELECT 205 , 135
UNION SELECT 212 , 147
),search AS (
SELECT '{110,130,135,147,152}'::int4[] as search
), searched AS (
SELECT DISTINCT item_id,
tag_id
FROM item_tag_assn
JOIN search ON (tag_id) = ANY(search)
ORDER BY 1, 2
), aggregated AS (
SELECT item_id,
array_agg(tag_id) AS agg
FROM searched
GROUP BY 1
)
SELECT *
FROM aggregated, search
WHERE agg = search
;
结果:
item_id | agg | search
---------+-----------------------+-----------------------
205 | {110,130,135,147,152} | {110,130,135,147,152}
(1 row)
如果将搜索更改为 '{110,135,147}'
:
item_id | agg | search
---------+---------------+---------------
212 | {110,135,147} | {110,135,147}
205 | {110,135,147} | {110,135,147}
(2 rows)
对于 运行 产品,您需要创建索引 CREATE INDEX ON item_tag_assn (tag_id);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=32.72..35.34 rows=1 width=68) (actual time=0.055..0.059 rows=3 loops=1)
Hash Cond: (aggregated.agg = search.search)
CTE search
-> Result (cost=0.00..0.01 rows=1 width=32) (actual time=0.001..0.001 rows=1 loops=1)
CTE searched
-> Unique (cost=27.73..28.55 rows=110 width=8) (actual time=0.029..0.031 rows=3 loops=1)
-> Sort (cost=27.73..28.00 rows=110 width=8) (actual time=0.029..0.029 rows=3 loops=1)
Sort Key: x.item_id, x.tag_id
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=10.40..24.00 rows=110 width=8) (actual time=0.013..0.014 rows=3 loops=1)
-> CTE Scan on search search_1 (cost=0.00..0.02 rows=1 width=32) (actual time=0.002..0.002 rows=1 loops=1)
-> Bitmap Heap Scan on x (cost=10.40..22.88 rows=110 width=8) (actual time=0.009..0.009 rows=3 loops=1)
Recheck Cond: (tag_id = ANY (search_1.search))
Heap Blocks: exact=1
-> Bitmap Index Scan on i1 (cost=0.00..10.38 rows=110 width=0) (actual time=0.002..0.002 rows=3 loops=1)
Index Cond: (tag_id = ANY (search_1.search))
CTE aggregated
-> HashAggregate (cost=2.75..4.12 rows=110 width=36) (actual time=0.038..0.039 rows=3 loops=1)
Group Key: searched.item_id
-> CTE Scan on searched (cost=0.00..2.20 rows=110 width=8) (actual time=0.029..0.031 rows=3 loops=1)
-> CTE Scan on aggregated (cost=0.00..2.20 rows=110 width=36) (actual time=0.040..0.043 rows=3 loops=1)
-> Hash (cost=0.02..0.02 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> CTE Scan on search (cost=0.00..0.02 rows=1 width=32) (actual time=0.000..0.001 rows=1 loops=1)
Planning time: 0.309 ms
Execution time: 0.115 ms