使用休眠搜索以单词开头或以单词结尾

Starts with a word or ends with a word using hibernate search

我正在使用带 spring-boot 的 Hibernate Search。我要求用户让搜索运算符对企业名称执行以下操作:

  1. 以单词开头

.Ali --> Means the phrase should strictly start with Ali, which means AlAli should not return in the results

query = queryBuilder.keyword().wildcard().onField("establishmentNameEn")
                        .matching(term + "*").createQuery();

It returning mix result containing term in mid, start or in end not as per the above requirement

  1. 以一个词结尾

Kamran. --> Means it should strictly end end Kamran, meaning that Kamranullah should not be returned in the results

query = queryBuilder.keyword().wildcard().onField("establishmentNameEn")
                        .matching("*"+term).createQuery();

As per documentation, its not a good idea to put “*” in start. My question here is: how can i achieve the expected result

我的域class和分析器:

 @AnalyzerDef(name = "english", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
        @TokenFilterDef(factory = StandardFilterFactory.class),
        @TokenFilterDef(factory = LowerCaseFilterFactory.class), })
@Indexed
@Entity
@Table(name = "DIRECTORY")
public class DirectoryEntity {
@Analyzer(definition = "english")
@Field(store = Store.YES)
@Column(name = "ESTABLISHMENT_NAME_EN")
private String establishmentNameEn;

getter and setter
}

这里有两个问题:

标记化

您正在使用分词器,这意味着您的搜索将使用单词,而不是您索引的完整字符串。这解释了您在句子中间的术语上进行了匹配。

这可以通过为这些特殊的 begin/end 查询创建一个单独的字段,并使用带有 KeywordTokenizer 的分析器(这是一个空操作)来解决。

例如:

 @AnalyzerDef(name = "english", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
        @TokenFilterDef(factory = StandardFilterFactory.class),
        @TokenFilterDef(factory = LowerCaseFilterFactory.class), })
 @AnalyzerDef(name = "english_beginEnd", tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class), filters = {
        @TokenFilterDef(factory = StandardFilterFactory.class),
        @TokenFilterDef(factory = LowerCaseFilterFactory.class), })
@Indexed
@Entity
@Table(name = "DIRECTORY")
public class DirectoryEntity {
@Analyzer(definition = "english")
@Field(store = Store.YES)
@Field(name = "establishmentNameEn_beginEnd", store = Store.YES, analyzer = @Analyzer(definition = "english_beginEnd"))
@Column(name = "ESTABLISHMENT_NAME_EN")
private String establishmentNameEn;

getter and setter
}

查询分析和性能

通配符查询不会触发对输入文本的分析。这将导致意外行为。例如,如果您索引 "Ali",然后搜索 "ali",您可能会得到结果,但如果您搜索 "Ali",则不会:文本被分析并索引为 "ali",与 "Ali".

不完全匹配

此外,如您所知,前导通配符的性能非常非常糟糕。

如果您的字段长度合理(例如,少于 30 个字符),我建议改用 "edge-ngram" 分析器;你会在这里找到解释:

请注意,您仍然需要使用 KeywordTokenizer(与我链接的示例不同)。

这将处理 "match the beginning of the text" 查询,但不会处理 "match the end of the text" 查询。

为了解决第二个查询,我将创建一个单独的字段和一个单独的分析器,类似于用于第一个查询的分析器,唯一的区别是您在 [=14] 之前插入一个 ReverseStringFilterFactory =].这将在索引 ngram 之前反转文本,这应该会导致所需的行为。不要忘记为此字段也使用一个单独的查询分析器,一个反转字符串的查询分析器。