如何使确切的 unicode 字符优先于 ASCII 版本?

How to make exact unicode characters take priority over ASCII versions?

我有一个数据库,其中包含德国城镇和城市的名称,例如慕尼黑和明斯特。

如果我这样查询:

SELECT name,
       MATCH(name) AGAINST('+mün*' IN BOOLEAN MODE) AS relevance
FROM place_names
ORDER BY relevance DESC

对于包含 munmün 或任何其他在不考虑变音符号的情况下扁平化为 mun 的文本,我得到相同的相关值。换句话说,搜索 munmün 会得到完全相同的结果。

我如何配置我的数据库,以便搜索 mün 会为实际包含字母 ü 的词提供更高的相关性,但仍将 u 视为匹配项?

CREATE TABLE place_names (id SERIAL PRIMARY KEY, name VARCHAR(255));
CREATE FULLTEXT INDEX idx ON place_names (name);
INSERT INTO place_names (name) VALUES ('Munich'), ('Münster');
SELECT * FROM place_names;
id name
1 Munich
2 Münster
SELECT name,
       MATCH(name) AGAINST('+mün*' IN BOOLEAN MODE) AS relevance
FROM place_names
ORDER BY relevance DESC;
name relevance
Munich 0.000000001885928302414186
Münster 0.000000001885928302414186
ALTER TABLE place_names ADD COLUMN name2 VARCHAR(255) COLLATE utf8mb4_0900_bin AS (name) STORED;
CREATE FULLTEXT INDEX idx2 ON place_names (name2);
SELECT name,
       MATCH(name) AGAINST('+mün*' IN BOOLEAN MODE) AS relevance,
       MATCH(name2) AGAINST('+mün*' IN BOOLEAN MODE) AS relevance2
FROM place_names
ORDER BY relevance DESC;
name relevance relevance2
Munich 0.000000001885928302414186 0
Münster 0.000000001885928302414186 0.0906190574169159

db<>fiddle here

因此

SELECT name,
       MATCH(name) AGAINST('+mün*' IN BOOLEAN MODE) AS relevance
FROM place_names
ORDER BY MATCH(name2) AGAINST('+mün*' IN BOOLEAN MODE) DESC;

一种方法可能是:

WHERE MATCH(name) AGAINST ('+mün*' IN BOOLEAN MODE) AS relevance
ORDER BY name LIKE '%Mün%' COLLATE utf8mb4_bin DESC, relevance DESC

另一件需要注意的事情是 MySQL 8.0 中存在排序规则 utf8mb4_0900_as_ci——“区分重音和不区分大小写”。 (但是,那根本不匹配“Mun”。)