MySQL 在一秒钟内根据条目对相似行进行分组 table

MySQL Grouping similar rows based on entries in a second table

我真的不知道取什么标题。

我有几个 table 结构是这样的

一个"sentences"table

id |    sentence       | ...
----------------------------
1  | See Spot run      | ...
2  | See Jane run      | ...
3  | Jane likes cheese | ...

一个"words"table

id | word (unique)
----------
1  | See
2  | Spot
3  | run
4  | Jane
5  | likes
6  | cheese

还有一个"word_references"table

sentence_id | word_id
---------------------
          1 | 1 
          1 | 2
          1 | 3
          2 | 1
          2 | 3
          2 | 4
          3 | 4
          3 | 5
          3 | 6

我想 return 一个基于共享词的相似句子对列表(按相似度排序)。所以它应该 return:

one | two | similarity
----------------------
 1  |  2  |  2
 2  |  3  |  1

因为句子 1 和句子 2 共用两个词:"See" 和 "run",而句子 2 和句子 3 共用一个词:"Jane".

此查询应该可以解决您的问题:

SELECT r1.sentence_id AS one, 
       r2.sentence_id AS two, 
       Count(*)       AS similarity 
FROM   word_references r1 
       INNER JOIN word_references r2 
               ON r1.sentence_id < r2.sentence_id 
                  AND r1.word_id = r2.word_id 
GROUP  BY r1.sentence_id, 
          r2.sentence_id 

这给出:

one | two | similarity
----------------------
 1  |  2  |  2
 2  |  3  |  1

sqlfiddle here

如果将表达式 r1.sentence_id < r2.sentence_id 更改为 r1.sentence_id <> r2.sentence_id,您将得到关系的两边:

one | two | similarity
----------------------
 1  |  2  |  2
 2  |  3  |  1
 2  |  1  |  2
 3  |  2  |  1

像这样的东西会起作用:

select w1.sentence_id, w2.sentence_id, count(*) as similarity
from word_references w1 
left join word_references w2 on  w1.word_id=w2.word_id and w1.sentence_id<>w2.sentence_id
where w2.sentence_id is not null
group by w1.sentence_id, w2.sentence_id 
order by count(*) desc

示例输出:

+ ---------------- + ---------------- + --------------- +
| sentence_id      | sentence_id      | similarity      |
+ ---------------- + ---------------- + --------------- +
| 1                | 2                | 2               |
| 2                | 1                | 2               |
| 3                | 2                | 1               |
| 2                | 3                | 1               |
+ ---------------- + ---------------- + --------------- +
4 rows