如何使用 pattern tokenizer 仅索引在 lucene 中以大写字母开头的单词

How to use pattern tokenizer to index only words that start with capital letter in lucene

我正在使用 Lucene 5.1.0,我希望我的索引编写器仅索引以大写字母开头的术语。 我研究了自定义分析器和模式分词器,但我不明白如何使用它们来仅索引以大写字母开头的单词(或所有字母)。 任何帮助将不胜感激

我发现这个 link 有助于我理解自定义 tokenizers/analyzers/filters: http://www.citrine.io/blog/2015/2/14/building-a-custom-analyzer-in-lucene

但是,在您的情况下,我认为扩展 org.apache.lucene.analysis.util.FilteringTokenFilter 而不是 TokenFilter 更容易:

public class StartsWithCapitalTokenFilter extends FilteringTokenFilter {

    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public StartsWithCapitalTokenFilter(TokenStream tokenStream) {
        super(tokenStream);
    }

    @Override
    public boolean accept() {
        // When accept() is called, my understanding is that termAtt.buffer() will
        // contain the particular string (in char[] form) of whichever token
        // is under consideration. This call gets the Unicode code point of the
        // first character and checks if it's uppercase.
        return Character.isUpperCase(Character.codePointAt(termAtt.buffer(),0));

        // Or if you don't want to care about Unicode about U+FFFF, use the below.
        //return Character.isUpperCase(termAtt.buffer()[0]);
    }
}

然后您将需要某种自定义分析器来使用过滤器。这个只使用新过滤器:

public class StartswithCapitalAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(String field, Reader reader) {
        Tokenizer tokenizer = new StandardTokenizer();
        TokenStream filter = new StartsWithCapitalTokenFilter(tokenizer);

        // chain any other filters you want in here, like so:
        //filter = new LowerCaseFilter(filter);

        return new TokenStreamComponents(tokenizer, filter);
    }
}

虽然我现在没有测试它的环境,但它们应该都可以正常工作。 祝你好运!