如何制作一个不区分大小写的部分文本搜索引擎，它使用带 MongoDB 和 PHP 的正则表达式？

Question

我正在尝试改进我的应用程序中的搜索栏。如果用户现在在搜索栏中输入“泰坦”，每次我使用以下正则表达式函数时，应用程序都会从 MongoDB 中检索电影“泰坦尼克号”：

require 'dbconnection.php';
if ($_SERVER["REQUEST_METHOD"] == "POST") {
   $input= $_REQUEST['input'];
$query=$collection->find(['movie' => new MongoDB\BSON\Regex($input)]);
}

我还可以通过在 Mongo shell 中创建以下索引来使集合不区分大小写，因此如果用户在搜索栏中键入“tiTAnIc”，应用程序将检索电影“泰坦尼克号”来自 MongoDB:

db.createCollection("c1", { collation: { locale: 'en_US', strength: 2 } } )
db.c1.createIndex( { movie: 1 } )

但是，我无法同时结合这两个功能。当我将查询更改为以下内容时，上面的索引只会删除区分大小写：

$query=$collection->find( [ 'movie' => $input] );

如果我在顶部使用正则表达式查询和整理索引，它将忽略正则表达式部分，所以如果我键入“Titan”，它不会检索任何内容；但是，如果我键入“泰坦尼克号”，它将成功检索“泰坦尼克号”（因为“泰坦尼克号”是我数据库中存储的确切词）。

有什么建议吗？

Answer 1

注意： 索引列上的正则表达式搜索会影响性能，如 $regex docs:

所述

Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes.

您的问题是 MongoDB 在 $regex 上使用 prefix（例如：/^acme/）来查找索引。

For case sensitive regular expression queries, if an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan. Further optimization can occur if the regular expression is a “prefix expression”, which means that all potential matches start with the same string. This allows MongoDB to construct a “range” from that prefix and only match against those values from the index that fall within that range.

所以需要改成这样：

$query=$collection->find(['movie' => new MongoDB\BSON\Regex('^'.$input, 'i')]);

我建议你更仔细地设计你的 collection。

如何制作一个不区分大小写的部分文本搜索引擎，它使用带 MongoDB 和 PHP 的正则表达式？

How do I make a Case Insensitive, Partial Text Search Engine that uses Regex with MongoDB and PHP?

php

regex

search-engine

case-insensitive

mongodb