使用休眠搜索以单词开头或以单词结尾
Starts with a word or ends with a word using hibernate search
我正在使用带 spring-boot 的 Hibernate Search。我要求用户让搜索运算符对企业名称执行以下操作:
- 以单词开头
.Ali --> Means the phrase should strictly start with Ali, which means AlAli should not return in the results
query = queryBuilder.keyword().wildcard().onField("establishmentNameEn")
.matching(term + "*").createQuery();
It returning mix result containing term in mid, start or in end not as per the above requirement
- 以一个词结尾
Kamran. --> Means it should strictly end end Kamran, meaning that Kamranullah should not be returned in the results
query = queryBuilder.keyword().wildcard().onField("establishmentNameEn")
.matching("*"+term).createQuery();
As per documentation, its not a good idea to put “*” in start. My question here is: how can i achieve the expected result
我的域class和分析器:
@AnalyzerDef(name = "english", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
@TokenFilterDef(factory = StandardFilterFactory.class),
@TokenFilterDef(factory = LowerCaseFilterFactory.class), })
@Indexed
@Entity
@Table(name = "DIRECTORY")
public class DirectoryEntity {
@Analyzer(definition = "english")
@Field(store = Store.YES)
@Column(name = "ESTABLISHMENT_NAME_EN")
private String establishmentNameEn;
getter and setter
}
这里有两个问题:
标记化
您正在使用分词器,这意味着您的搜索将使用单词,而不是您索引的完整字符串。这解释了您在句子中间的术语上进行了匹配。
这可以通过为这些特殊的 begin/end 查询创建一个单独的字段,并使用带有 KeywordTokenizer
的分析器(这是一个空操作)来解决。
例如:
@AnalyzerDef(name = "english", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
@TokenFilterDef(factory = StandardFilterFactory.class),
@TokenFilterDef(factory = LowerCaseFilterFactory.class), })
@AnalyzerDef(name = "english_beginEnd", tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class), filters = {
@TokenFilterDef(factory = StandardFilterFactory.class),
@TokenFilterDef(factory = LowerCaseFilterFactory.class), })
@Indexed
@Entity
@Table(name = "DIRECTORY")
public class DirectoryEntity {
@Analyzer(definition = "english")
@Field(store = Store.YES)
@Field(name = "establishmentNameEn_beginEnd", store = Store.YES, analyzer = @Analyzer(definition = "english_beginEnd"))
@Column(name = "ESTABLISHMENT_NAME_EN")
private String establishmentNameEn;
getter and setter
}
查询分析和性能
通配符查询不会触发对输入文本的分析。这将导致意外行为。例如,如果您索引 "Ali",然后搜索 "ali",您可能会得到结果,但如果您搜索 "Ali",则不会:文本被分析并索引为 "ali",与 "Ali".
不完全匹配
此外,如您所知,前导通配符的性能非常非常糟糕。
如果您的字段长度合理(例如,少于 30 个字符),我建议改用 "edge-ngram" 分析器;你会在这里找到解释:
请注意,您仍然需要使用 KeywordTokenizer
(与我链接的示例不同)。
这将处理 "match the beginning of the text" 查询,但不会处理 "match the end of the text" 查询。
为了解决第二个查询,我将创建一个单独的字段和一个单独的分析器,类似于用于第一个查询的分析器,唯一的区别是您在 [=14] 之前插入一个 ReverseStringFilterFactory
=].这将在索引 ngram 之前反转文本,这应该会导致所需的行为。不要忘记为此字段也使用一个单独的查询分析器,一个反转字符串的查询分析器。
我正在使用带 spring-boot 的 Hibernate Search。我要求用户让搜索运算符对企业名称执行以下操作:
- 以单词开头
.Ali --> Means the phrase should strictly start with Ali, which means AlAli should not return in the results
query = queryBuilder.keyword().wildcard().onField("establishmentNameEn")
.matching(term + "*").createQuery();
It returning mix result containing term in mid, start or in end not as per the above requirement
- 以一个词结尾
Kamran. --> Means it should strictly end end Kamran, meaning that Kamranullah should not be returned in the results
query = queryBuilder.keyword().wildcard().onField("establishmentNameEn")
.matching("*"+term).createQuery();
As per documentation, its not a good idea to put “*” in start. My question here is: how can i achieve the expected result
我的域class和分析器:
@AnalyzerDef(name = "english", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
@TokenFilterDef(factory = StandardFilterFactory.class),
@TokenFilterDef(factory = LowerCaseFilterFactory.class), })
@Indexed
@Entity
@Table(name = "DIRECTORY")
public class DirectoryEntity {
@Analyzer(definition = "english")
@Field(store = Store.YES)
@Column(name = "ESTABLISHMENT_NAME_EN")
private String establishmentNameEn;
getter and setter
}
这里有两个问题:
标记化
您正在使用分词器,这意味着您的搜索将使用单词,而不是您索引的完整字符串。这解释了您在句子中间的术语上进行了匹配。
这可以通过为这些特殊的 begin/end 查询创建一个单独的字段,并使用带有 KeywordTokenizer
的分析器(这是一个空操作)来解决。
例如:
@AnalyzerDef(name = "english", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
@TokenFilterDef(factory = StandardFilterFactory.class),
@TokenFilterDef(factory = LowerCaseFilterFactory.class), })
@AnalyzerDef(name = "english_beginEnd", tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class), filters = {
@TokenFilterDef(factory = StandardFilterFactory.class),
@TokenFilterDef(factory = LowerCaseFilterFactory.class), })
@Indexed
@Entity
@Table(name = "DIRECTORY")
public class DirectoryEntity {
@Analyzer(definition = "english")
@Field(store = Store.YES)
@Field(name = "establishmentNameEn_beginEnd", store = Store.YES, analyzer = @Analyzer(definition = "english_beginEnd"))
@Column(name = "ESTABLISHMENT_NAME_EN")
private String establishmentNameEn;
getter and setter
}
查询分析和性能
通配符查询不会触发对输入文本的分析。这将导致意外行为。例如,如果您索引 "Ali",然后搜索 "ali",您可能会得到结果,但如果您搜索 "Ali",则不会:文本被分析并索引为 "ali",与 "Ali".
不完全匹配此外,如您所知,前导通配符的性能非常非常糟糕。
如果您的字段长度合理(例如,少于 30 个字符),我建议改用 "edge-ngram" 分析器;你会在这里找到解释:
请注意,您仍然需要使用 KeywordTokenizer
(与我链接的示例不同)。
这将处理 "match the beginning of the text" 查询,但不会处理 "match the end of the text" 查询。
为了解决第二个查询,我将创建一个单独的字段和一个单独的分析器,类似于用于第一个查询的分析器,唯一的区别是您在 [=14] 之前插入一个 ReverseStringFilterFactory
=].这将在索引 ngram 之前反转文本,这应该会导致所需的行为。不要忘记为此字段也使用一个单独的查询分析器,一个反转字符串的查询分析器。