MongoDB diacriticInSensitive 搜索未按预期显示所有重音(带有变音符号的单词)行,反之亦然
MongoDB diacriticInSensitive search not showing all accented (words with diacritic mark) rows as expected and vice-versa
我有一个具有以下结构的文档集合
uid, name
有索引
db.Collection.createIndex({name: "text"})
它包含以下数据
1, iphone
2, iphóne
3, iphonë
4, iphónë
当我对 iphone
进行文本搜索时
我只得到两条记录,这是意外的
actual output
--------------
1, iphone
2, iphóne
如果我搜索 iphonë
db.Collection.find( { $text: { $search: "iphonë"} } );
I am getting
---------------------
3, iphonë
4, iphónë
但实际上我期待以下输出
db.Collection.find( { $text: { $search: "iphone"} } );
db.Collection.find( { $text: { $search: "iphónë"} } );
Expected output
------------------
1, iphone
2, iphóne
3, iphonë
4, iphónë
我在这里遗漏了什么吗?
如何通过搜索 iphone
或 iphónë
?
获得超出预期的输出
由于 mongodb 3.2、text indexes 对变音符号不敏感:
With version 3, text index is diacritic insensitive. That is, the
index does not distinguish between characters that contain diacritical
marks and their non-marked counterpart, such as é, ê, and e. More
specifically, the text index strips the characters categorized as
diacritics in Unicode 8.0 Character Database Prop List.
所以下面的查询应该有效:
db.Collection.find( { $text: { $search: "iphone"} } );
db.Collection.find( { name: { $regex: "iphone"} } );
但看起来分音符 ( ¨ ) 存在错误,即使它在 unicode 8.0 列表中被分类为变音符号(JIRA 上的问题:SERVER-29918)
解决方案
因为 mongodb 3.4 你可以使用 collation 来执行这种查询:
例如,要获得预期的输出,运行 以下查询:
db.Collection.find({name: "iphone"}).collation({locale: "en", strength: 1})
这将输出:
{ "_id" : 1, "name" : "iphone" }
{ "_id" : 2, "name" : "iphône" }
{ "_id" : 3, "name" : "iphonë" }
{ "_id" : 4, "name" : "iphônë" }
在排序规则中,strength
是要执行的比较级别
- 1 : 仅基本字符
- 2 :变音符号敏感
- 3 :区分大小写 + 区分变音符号
我有一个具有以下结构的文档集合
uid, name
有索引
db.Collection.createIndex({name: "text"})
它包含以下数据
1, iphone
2, iphóne
3, iphonë
4, iphónë
当我对 iphone
进行文本搜索时
我只得到两条记录,这是意外的
actual output
--------------
1, iphone
2, iphóne
如果我搜索 iphonë
db.Collection.find( { $text: { $search: "iphonë"} } );
I am getting
---------------------
3, iphonë
4, iphónë
但实际上我期待以下输出
db.Collection.find( { $text: { $search: "iphone"} } );
db.Collection.find( { $text: { $search: "iphónë"} } );
Expected output
------------------
1, iphone
2, iphóne
3, iphonë
4, iphónë
我在这里遗漏了什么吗?
如何通过搜索 iphone
或 iphónë
?
由于 mongodb 3.2、text indexes 对变音符号不敏感:
With version 3, text index is diacritic insensitive. That is, the index does not distinguish between characters that contain diacritical marks and their non-marked counterpart, such as é, ê, and e. More specifically, the text index strips the characters categorized as diacritics in Unicode 8.0 Character Database Prop List.
所以下面的查询应该有效:
db.Collection.find( { $text: { $search: "iphone"} } );
db.Collection.find( { name: { $regex: "iphone"} } );
但看起来分音符 ( ¨ ) 存在错误,即使它在 unicode 8.0 列表中被分类为变音符号(JIRA 上的问题:SERVER-29918)
解决方案
因为 mongodb 3.4 你可以使用 collation 来执行这种查询:
例如,要获得预期的输出,运行 以下查询:
db.Collection.find({name: "iphone"}).collation({locale: "en", strength: 1})
这将输出:
{ "_id" : 1, "name" : "iphone" }
{ "_id" : 2, "name" : "iphône" }
{ "_id" : 3, "name" : "iphonë" }
{ "_id" : 4, "name" : "iphônë" }
在排序规则中,strength
是要执行的比较级别
- 1 : 仅基本字符
- 2 :变音符号敏感
- 3 :区分大小写 + 区分变音符号