如何忽略 Lucene 查询中的某些字符（Hibernate 搜索）

Question

我已将此实体编入索引

@Entity
@Indexed
public class MyBean {

    @Id
    private Long id;

    @Field
    private String foo;

    @Field
    private String bar;

    @Field
    private String baz;

}

对于此架构：

+----+-------------+-------------+-------------+
| id |     foo     |     bar     |     baz     |
+----+-------------+-------------+-------------+
| 11 | an example  | ignore this | ignore this |
| 12 | ignore this | an e.x.a.m. | ignore this |
| 13 | not this    | not this    | not this    |
+----+-------------+-------------+-------------+

我需要通过搜索 exam 来找到 11 和 12。

我试过：

FullTextEntityManager fullTextEntityManager = 
    Search.getFullTextEntityManager(this.entityManager);

QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory()
    .buildQueryBuilder().forEntity(MyBean.class).get();

Query textQuery = queryBuilder.keyword()
    .onFields("foo", "bar", "baz").matching("exam").createQuery();

fullTextEntityManager.createFullTextQuery(textQuery, MyBean.class).getResultList();

但这只找到实体 11，我还需要 12。这可能吗？

Answer 1

将带有 CATENATE_ALL 标志的 WordDelimiterFilter 添加到您的分析链中，可能是一个解决方案。

因此基于 StandardAnalyzer 的分析器实现如下所示：

public class StandardWithWordDelim extends StopwordAnalyzerBase{

    public static final CharArraySet STOP_WORDS_SET = StopAnalyzer.ENGLISH_STOP_WORDS_SET; 

    public StandardWithWordDelim() {
    }

    @Override
    protected TokenStreamComponents createComponents(final String fieldName) {
        StandardTokenizer src = new StandardTokenizer();
        src.setMaxTokenLength(255);
        TokenStream filter = new StandardFilter(src);
        filter = new LowerCaseFilter(filter);
        filter = new StopFilter(filter, stopwords);
        //I'm inclined to add it here, so the abbreviation "t.h.e." doesn't get whacked by the StopFilter.
        filter = new WordDelimiterFilter(filter, WordDelimiterFilter.CATENATE_ALL, null);
        return new TokenStreamComponents(src, filter);
    }
}

您似乎没有使用标准分析器（也许是 NGrams？），但您应该能够在某处将其纳入您的分析。

如何忽略 Lucene 查询中的某些字符（Hibernate 搜索）

How to ignore some chars in Lucene Query (Hibernate Search)

lucene

full-text-search

hibernate-search