MySQL 在一秒钟内根据条目对相似行进行分组 table
MySQL Grouping similar rows based on entries in a second table
我真的不知道取什么标题。
我有几个 table 结构是这样的
一个"sentences"table
id | sentence | ...
----------------------------
1 | See Spot run | ...
2 | See Jane run | ...
3 | Jane likes cheese | ...
一个"words"table
id | word (unique)
----------
1 | See
2 | Spot
3 | run
4 | Jane
5 | likes
6 | cheese
还有一个"word_references"table
sentence_id | word_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 3
2 | 4
3 | 4
3 | 5
3 | 6
我想 return 一个基于共享词的相似句子对列表(按相似度排序)。所以它应该 return:
one | two | similarity
----------------------
1 | 2 | 2
2 | 3 | 1
因为句子 1 和句子 2 共用两个词:"See" 和 "run",而句子 2 和句子 3 共用一个词:"Jane".
此查询应该可以解决您的问题:
SELECT r1.sentence_id AS one,
r2.sentence_id AS two,
Count(*) AS similarity
FROM word_references r1
INNER JOIN word_references r2
ON r1.sentence_id < r2.sentence_id
AND r1.word_id = r2.word_id
GROUP BY r1.sentence_id,
r2.sentence_id
这给出:
one | two | similarity
----------------------
1 | 2 | 2
2 | 3 | 1
sqlfiddle here
如果将表达式 r1.sentence_id < r2.sentence_id
更改为 r1.sentence_id <> r2.sentence_id
,您将得到关系的两边:
one | two | similarity
----------------------
1 | 2 | 2
2 | 3 | 1
2 | 1 | 2
3 | 2 | 1
像这样的东西会起作用:
select w1.sentence_id, w2.sentence_id, count(*) as similarity
from word_references w1
left join word_references w2 on w1.word_id=w2.word_id and w1.sentence_id<>w2.sentence_id
where w2.sentence_id is not null
group by w1.sentence_id, w2.sentence_id
order by count(*) desc
示例输出:
+ ---------------- + ---------------- + --------------- +
| sentence_id | sentence_id | similarity |
+ ---------------- + ---------------- + --------------- +
| 1 | 2 | 2 |
| 2 | 1 | 2 |
| 3 | 2 | 1 |
| 2 | 3 | 1 |
+ ---------------- + ---------------- + --------------- +
4 rows
我真的不知道取什么标题。
我有几个 table 结构是这样的
一个"sentences"table
id | sentence | ...
----------------------------
1 | See Spot run | ...
2 | See Jane run | ...
3 | Jane likes cheese | ...
一个"words"table
id | word (unique)
----------
1 | See
2 | Spot
3 | run
4 | Jane
5 | likes
6 | cheese
还有一个"word_references"table
sentence_id | word_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 3
2 | 4
3 | 4
3 | 5
3 | 6
我想 return 一个基于共享词的相似句子对列表(按相似度排序)。所以它应该 return:
one | two | similarity
----------------------
1 | 2 | 2
2 | 3 | 1
因为句子 1 和句子 2 共用两个词:"See" 和 "run",而句子 2 和句子 3 共用一个词:"Jane".
此查询应该可以解决您的问题:
SELECT r1.sentence_id AS one,
r2.sentence_id AS two,
Count(*) AS similarity
FROM word_references r1
INNER JOIN word_references r2
ON r1.sentence_id < r2.sentence_id
AND r1.word_id = r2.word_id
GROUP BY r1.sentence_id,
r2.sentence_id
这给出:
one | two | similarity
----------------------
1 | 2 | 2
2 | 3 | 1
sqlfiddle here
如果将表达式 r1.sentence_id < r2.sentence_id
更改为 r1.sentence_id <> r2.sentence_id
,您将得到关系的两边:
one | two | similarity
----------------------
1 | 2 | 2
2 | 3 | 1
2 | 1 | 2
3 | 2 | 1
像这样的东西会起作用:
select w1.sentence_id, w2.sentence_id, count(*) as similarity
from word_references w1
left join word_references w2 on w1.word_id=w2.word_id and w1.sentence_id<>w2.sentence_id
where w2.sentence_id is not null
group by w1.sentence_id, w2.sentence_id
order by count(*) desc
示例输出:
+ ---------------- + ---------------- + --------------- +
| sentence_id | sentence_id | similarity |
+ ---------------- + ---------------- + --------------- +
| 1 | 2 | 2 |
| 2 | 1 | 2 |
| 3 | 2 | 1 |
| 2 | 3 | 1 |
+ ---------------- + ---------------- + --------------- +
4 rows