MYSQL 中未正确识别不同的术语

Question

我创建了一个数据库，用于存储从孟加拉语文本文档创建的简单倒排索引。

Table 姓名: simple_index , 主键 {Term, Document_id}

Table 定义:

CREATE TABLE IF NOT EXISTS basicindex.simple_index (
    term varchar(255) NOT NULL, 
    doc_id INT NOT NULL,
    frequency INT NOT NULL,
    PRIMARY KEY (term,doc_id) 
)

奇怪的是，我发现下面两个不同的词：

খুঁজে - 存在于文件 3、16、34
খুজে - 存在于文档 1

当我执行以下查询时：

查询 1:

select doc_id from basicindex.simple_index where term='খুঁজে';

查询 2:

select doc_id from basicindex.simple_index where term = 'খুজে';

return 4 行声称 খুঁজে 和 খুজে 出现在所有四个文档中。

我从日志中发现 [Distinct Term, document id, frequency] খুঁজে 仅针对文档 id 1 插入:

正在为 খুজে 插入索引 ->{ DocID：1，频率：1}

('খুজে', 1, 1)

并且 খুঁজে 被插入到文档 ID 3、16 和 34

正在为 খুঁজে 插入索引 ->{ DocID：3，频率：1}

('খুঁজে', 3, 1)

正在为 খুঁজে 插入索引 ->{ DocID：16，频率：2}

('খুঁজে', 16, 2)

正在为 খুঁজে 插入索引 ->{ DocID：34，频率：1}

('খুঁজে', 34, 1)

以下是术语的 unicode 值：

খুঁজে [('খ', 2454), ('ু',2497), ('ঁ',2433), ('জ',2460), ('ে',2503)]

খুজে [('খ',2454), ('ু',2497), ('জ',2460), ('ে',2503)]

我正在使用 MYSQL 版本 8.0.13。我请求有人帮助我理解为什么 MYSQL 数据库表现出这种行为。为什么无法区分“খুঁজে”和“খুজে”？我该怎么做才能纠正这个问题？

我已附上文档 1、3、16 和 34 以及输入和输出日志文件，供您参考 here。

Answer 1

both return 4 rows claiming that খুঁজে and খুজে are present in all the four documents.

检查使用的 COLLATION。明确指定所需的 COLLATE。

举个例子：

CREATE TABLE IF NOT EXISTS simple_index (
    term varchar(255) NOT NULL, 
    doc_id INT NOT NULL,
    frequency INT NOT NULL,
    PRIMARY KEY (term,doc_id) 
);

INSERT INTO simple_index VALUES
('খুঁজে', 1, 0 ),
('খুজে', 2, 0 );
SELECT * FROM simple_index;
term doc_id frequency

খুঁজে 1 0

খুজে 2 0

select doc_id from simple_index where term = 'খুঁজে';
select doc_id from simple_index where term = 'খুজে';

| doc_id |
| -----: |
|      1 |
|      2 |

| doc_id |
| -----: |
|      1 |
|      2 |

select doc_id from simple_index where term = 'খুঁজে'COLLATE utf8mb4_bin;
select doc_id from simple_index where term = 'খুজে' COLLATE utf8mb4_bin;

| doc_id |
| -----: |
|      1 |

| doc_id |
| -----: |
|      2 |

db<>fiddle here

MYSQL 中未正确识别不同的术语

Distinct terms not being identified correctly in MYSQL

python

mysql

database

unicode-string

mysql-python