Lucene 6 - 如何在编写索引时拦截标记化?

Lucene 6 - how to intercept tokenizing when writing an index?

This question says look at this question...但不幸的是,这些聪明人的解决方案似乎不再适用于 Lucene 6,因为 createComponents 的签名现在是

TokenStreamComponents createComponents(final String fieldName)...

Reader 不再提供。

有人知道现在的技术应该是什么吗?我们是否打算让 Reader 成为 Analyzer class 的一个字段?

注意我实际上并不想过滤任何东西,我想掌握令牌流以创建我自己的数据结构(用于频率分析和序列匹配)。所以思路是利用Lucene的Analyzer技术,产生不同模型的语料库。一个简单的例子可能是:一个模型中的所有内容都是小写的,另一个模型中的大小写与语料库中一样。

PS 我也看到了 this question: 但我们必须再次提供一个 Reader: 即我假设上下文是为了查询的目的而标记化的。当 一个索引时,虽然早期版本中的 Analyzers 很明显在 createComponents 被调用时从某处获得了 Reader,但你还没有有一个Reader(我知道...)

知道了,再次使用 referenced question... Analyzer 的:createComponents.

因此,我的篡改版本 EnglishAnalyzer:

private int nTerm = 0; // field added by me

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer source = new StandardTokenizer();
    TokenStream result = new StandardFilter(source);
    result = new EnglishPossessiveFilter(result);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopwords);
    if (!stemExclusionSet.isEmpty())
        result = new SetKeywordMarkerFilter(result, stemExclusionSet);
    result = new PorterStemFilter(result);

    // my modification starts here:
    class ExamineFilter extends FilteringTokenFilter {
        private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
        public ExamineFilter( TokenStream in ) {
                super(  in);
          }         
        @Override
        protected boolean accept() throws IOException {
            String term = new String( termAtt.buffer(), 0, termAtt.length() );
            printOut( String.format( "# term %d |%s|", nTerm, term ));

            // do all sorts of things with this term... 

            nTerm++;
            return true;
        }
    }
    class MyTokenStreamComponents extends TokenStreamComponents {
        MyTokenStreamComponents( Tokenizer source, TokenStream result ){
            super( source, result );
        }
        public TokenStream getTokenStream(){
            // reset term count at start of each Document
            nTerm = 0;
            return super.getTokenStream();
        }
    }
    result = new ExamineFilter( result );
    return new MyTokenStreamComponents(source, result);
    //
}

结果,输入:

    String[] contents = { "Humpty Dumpty sat on a wall,", "Humpty Dumpty had a great fall.", ... 

很棒:

# term 0 |humpti|
# term 1 |dumpti|
# term 2 |sat|
# term 3 |wall|

# term 0 |humpti|
# term 1 |dumpti|
# term 2 |had|
# term 3 |great|
# term 4 |fall|

# term 0 |all|
# term 1 |king|
# term 2 |hors|
# term 3 |all|
# term 4 |king|
# term 5 |men|

...