Lucene.net 4.8 无法用重音搜索

Lucene.net 4.8 unable to search with accent

基于 stack overflow 中的一些帮助,我设法创建了一个自定义分析器,但仍然无法解决搜索带有重音的单词的问题。

public class CustomAnalyzer : Analyzer
{
    LuceneVersion matchVersion;

    public CustomAnalyzer(LuceneVersion p_matchVersion) : base()
    {
        matchVersion = p_matchVersion;
    }
    protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
    {
        Tokenizer tokenizer = new KeywordTokenizer(reader);
        TokenStream result = new StopFilter(matchVersion, tokenizer, StopAnalyzer.ENGLISH_STOP_WORDS_SET);            
        result = new LowerCaseFilter(matchVersion, result); 
        result = new StandardFilter(matchVersion, result);
        result = new ASCIIFoldingFilter(result);
        return new TokenStreamComponents(tokenizer, result);
       
    }
}

我们的想法是能够搜索“perez”并找到“Pérez”。使用该分析器,我重新创建了索引并进行了搜索,但仍然没有找到带有重音的单词的结果。

作为 LuceneVersion 我正在使用 LuceneVersion.LUCENE_48

如有任何帮助,我们将不胜感激。 谢谢!

Answered originally on GitHub, but copying here for context.

不,在同一个分析器中使用多个分词器是无效的,因为有 strict consuming rules to adhere to.

构建代码分析组件以确保开发人员在键入时遵守这些分词器规则会很棒,例如 rule that ensures TokenStream classes are sealed or use a sealed IncrementToken() method (contributions welcome). It is not likely we will add any additional code analyzers prior to the 4.8.0 release unless they are contributed by the community, though, as these are not blocking the release. For the time being, the best way to ensure custom analyzers adhere to the rules are to test them with Lucene.Net.TestFramework,这也会对他们造成多线程、随机文化和随机文本字符串的影响以确保它们坚固耐用。

我在此处构建了一个演示,展示了如何在自定义分析器上设置测试:https://github.com/NightOwl888/LuceneNetCustomAnalyzerDemo(以及展示了上述示例如何未能通过测试)。功能分析器仅使用 WhiteSpaceTokenizerICUFoldingFilter。当然,您可能希望添加额外的测试条件以确保您的自定义分析器满足您的期望,然后您可以尝试使用不同的分词器并添加或重新排列过滤器,直到找到满足您所有要求的解决方案(以及由Lucene 的规则)。当然,您可以稍后在发现问题时添加其他条件。

using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Core;
using Lucene.Net.Analysis.Icu;
using Lucene.Net.Util;
using System.IO;

namespace LuceneExtensions
{
    public sealed class CustomAnalyzer : Analyzer
    {
        private readonly LuceneVersion matchVersion;

        public CustomAnalyzer(LuceneVersion matchVersion)
        {
            this.matchVersion = matchVersion;
        }

        protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
        {
            // Tokenize...
            Tokenizer tokenizer = new WhitespaceTokenizer(matchVersion, reader);
            TokenStream result = tokenizer;

            // Filter...
            result = new ICUFoldingFilter(result);

            // Return result...
            return new TokenStreamComponents(tokenizer, result);
        }
    }
}
using Lucene.Net.Analysis;
using NUnit.Framework;

namespace LuceneExtensions.Tests
{
    public class TestCustomAnalyzer : BaseTokenStreamTestCase
    {
        [Test]
        public virtual void TestRemoveAccents()
        {
            Analyzer a = new CustomAnalyzer(TEST_VERSION_CURRENT);

            // removal of latin accents (composed)
            AssertAnalyzesTo(a, "résumé", new string[] { "resume" });

            // removal of latin accents (decomposed)
            AssertAnalyzesTo(a, "re\u0301sume\u0301", new string[] { "resume" });

            // removal of latin accents (multi-word)
            AssertAnalyzesTo(a, "Carlos Pírez", new string[] { "carlos", "pirez" });
        }
    }
}

关于您可以使用的测试条件的其他想法,我建议您查看 Lucene.Net 的 extensive analyzer tests including the ICU tests。您还可以参考测试,看看是否可以找到与您的构建查询类似的用例(但请注意,测试未显示处理对象的 .NET 最佳实践)。