Lucene 6 - 如何在编写索引时拦截标记化?
Lucene 6 - how to intercept tokenizing when writing an index?
This question says look at this question...但不幸的是,这些聪明人的解决方案似乎不再适用于 Lucene 6,因为 createComponents
的签名现在是
TokenStreamComponents createComponents(final String fieldName)...
即Reader
不再提供。
有人知道现在的技术应该是什么吗?我们是否打算让 Reader
成为 Analyzer
class 的一个字段?
注意我实际上并不想过滤任何东西,我想掌握令牌流以创建我自己的数据结构(用于频率分析和序列匹配)。所以思路是利用Lucene的Analyzer
技术,产生不同模型的语料库。一个简单的例子可能是:一个模型中的所有内容都是小写的,另一个模型中的大小写与语料库中一样。
PS 我也看到了 this question: 但我们必须再次提供一个 Reader
: 即我假设上下文是为了查询的目的而标记化的。当 写 一个索引时,虽然早期版本中的 Analyzers
很明显在 createComponents
被调用时从某处获得了 Reader
,但你还没有有一个Reader
(我知道...)
知道了,再次使用 referenced question... Analyzer
的:createComponents
.
因此,我的篡改版本 EnglishAnalyzer
:
private int nTerm = 0; // field added by me
@Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream result = new StandardFilter(source);
result = new EnglishPossessiveFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopwords);
if (!stemExclusionSet.isEmpty())
result = new SetKeywordMarkerFilter(result, stemExclusionSet);
result = new PorterStemFilter(result);
// my modification starts here:
class ExamineFilter extends FilteringTokenFilter {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public ExamineFilter( TokenStream in ) {
super( in);
}
@Override
protected boolean accept() throws IOException {
String term = new String( termAtt.buffer(), 0, termAtt.length() );
printOut( String.format( "# term %d |%s|", nTerm, term ));
// do all sorts of things with this term...
nTerm++;
return true;
}
}
class MyTokenStreamComponents extends TokenStreamComponents {
MyTokenStreamComponents( Tokenizer source, TokenStream result ){
super( source, result );
}
public TokenStream getTokenStream(){
// reset term count at start of each Document
nTerm = 0;
return super.getTokenStream();
}
}
result = new ExamineFilter( result );
return new MyTokenStreamComponents(source, result);
//
}
结果,输入:
String[] contents = { "Humpty Dumpty sat on a wall,", "Humpty Dumpty had a great fall.", ...
很棒:
# term 0 |humpti|
# term 1 |dumpti|
# term 2 |sat|
# term 3 |wall|
# term 0 |humpti|
# term 1 |dumpti|
# term 2 |had|
# term 3 |great|
# term 4 |fall|
# term 0 |all|
# term 1 |king|
# term 2 |hors|
# term 3 |all|
# term 4 |king|
# term 5 |men|
...
This question says look at this question...但不幸的是,这些聪明人的解决方案似乎不再适用于 Lucene 6,因为 createComponents
的签名现在是
TokenStreamComponents createComponents(final String fieldName)...
即Reader
不再提供。
有人知道现在的技术应该是什么吗?我们是否打算让 Reader
成为 Analyzer
class 的一个字段?
注意我实际上并不想过滤任何东西,我想掌握令牌流以创建我自己的数据结构(用于频率分析和序列匹配)。所以思路是利用Lucene的Analyzer
技术,产生不同模型的语料库。一个简单的例子可能是:一个模型中的所有内容都是小写的,另一个模型中的大小写与语料库中一样。
PS 我也看到了 this question: 但我们必须再次提供一个 Reader
: 即我假设上下文是为了查询的目的而标记化的。当 写 一个索引时,虽然早期版本中的 Analyzers
很明显在 createComponents
被调用时从某处获得了 Reader
,但你还没有有一个Reader
(我知道...)
知道了,再次使用 referenced question... Analyzer
的:createComponents
.
因此,我的篡改版本 EnglishAnalyzer
:
private int nTerm = 0; // field added by me
@Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream result = new StandardFilter(source);
result = new EnglishPossessiveFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopwords);
if (!stemExclusionSet.isEmpty())
result = new SetKeywordMarkerFilter(result, stemExclusionSet);
result = new PorterStemFilter(result);
// my modification starts here:
class ExamineFilter extends FilteringTokenFilter {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public ExamineFilter( TokenStream in ) {
super( in);
}
@Override
protected boolean accept() throws IOException {
String term = new String( termAtt.buffer(), 0, termAtt.length() );
printOut( String.format( "# term %d |%s|", nTerm, term ));
// do all sorts of things with this term...
nTerm++;
return true;
}
}
class MyTokenStreamComponents extends TokenStreamComponents {
MyTokenStreamComponents( Tokenizer source, TokenStream result ){
super( source, result );
}
public TokenStream getTokenStream(){
// reset term count at start of each Document
nTerm = 0;
return super.getTokenStream();
}
}
result = new ExamineFilter( result );
return new MyTokenStreamComponents(source, result);
//
}
结果,输入:
String[] contents = { "Humpty Dumpty sat on a wall,", "Humpty Dumpty had a great fall.", ...
很棒:
# term 0 |humpti|
# term 1 |dumpti|
# term 2 |sat|
# term 3 |wall|
# term 0 |humpti|
# term 1 |dumpti|
# term 2 |had|
# term 3 |great|
# term 4 |fall|
# term 0 |all|
# term 1 |king|
# term 2 |hors|
# term 3 |all|
# term 4 |king|
# term 5 |men|
...