此正则表达式不仅在 Solr 中有效

Question

This Regex is working perfectly in plain C# console application. Based on this we have started using SolrNet。尝试使用相同的正则表达式查询 Solr 实例的字段，抛出如下所示的异常

java.lang.IllegalArgumentException: expected ']' at position 70 at org.apache.lucene.util.automaton.RegExp.parseCharClassExp(RegExp.java:1087)

Answer 1

您使用的 Lucene 正则表达式引擎不同于 .NET 正则表达式引擎。

即使在 Lucene 模式中字符 class 的末尾未转义 时，连字符也是范围运算符。因此，转义连字符或移动到字符 class 开始，即 [a-zA-Z'-] => [-a-zA-Z'] 和 [^a-zA-Z'-] => [^-a-zA-Z'].

看起来 Lucene 正则表达式不支持非捕获组，因此从模式中删除所有 ?:。

所以，它看起来像

([-a-zA-Z']+[^-a-zA-Z']+){0,5}the([^-a-zA-Z']+[-a-zA-Z']+){0,5}([-a-zA-Z']+[^-a-zA-Z']+){0,5}the([^-a-zA-Z']+[-a-zA-Z']+){0,5}

Answer 2

根据您的评论，您的用例似乎最适合使用短语查询，您试过了吗？

像 "website whosebug.com is"~5 这样的查询可以工作并且性能更高。如果顺序很重要，您可以使用两个查询（"website Whosebug"~5 和 "whosebug.com is"~5）并使用自定义记分器删除不按顺序排列的。它将更加高效。

此正则表达式不仅在 Solr 中有效

This Regex is not working only in Solr

regex

solr

solrnet