如何在 lucene 5.0 中使用 ngram 分词器?
How to have ngram tokenizer in lucene 5.0?
我想为字符串生成 ngram 字符。下面是我用于它的 Lucene 4.1 库。
Reader reader = new StringReader(text);
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 3, 5); //catch contiguous sequence of 3, 4 and 5 characters
CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);
while (gramTokenizer.incrementToken()) {
String token = charTermAttribute.toString();
System.out.println(token);}
不过,我想用Lucene 5.0.0来做。 Lucene 5.0.0 中的NGramTokenizer 相对于之前的版本变化较大,参考http://lucene.apache.org/core/5_0_0/analyzers-common/index.html?org/apache/lucene/analysis/ngram/NGramTokenizer.html.
有人知道如何使用 Lucene 5.0.0 来做 ngrams 吗?
以下代码:
StringReader stringReader = new StringReader("abcd");
NGramTokenizer tokenizer = new NGramTokenizer(1, 2);
tokenizer.setReader(stringReader);
tokenizer.reset();
CharTermAttribute termAtt = tokenizer.getAttribute(CharTermAttribute.class);
while (tokenizer.incrementToken()) {
String token = termAtt.toString();
System.out.println(token);
}
将产生:
a
ab
b
bc
c
cd
d
我想为字符串生成 ngram 字符。下面是我用于它的 Lucene 4.1 库。
Reader reader = new StringReader(text);
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 3, 5); //catch contiguous sequence of 3, 4 and 5 characters
CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);
while (gramTokenizer.incrementToken()) {
String token = charTermAttribute.toString();
System.out.println(token);}
不过,我想用Lucene 5.0.0来做。 Lucene 5.0.0 中的NGramTokenizer 相对于之前的版本变化较大,参考http://lucene.apache.org/core/5_0_0/analyzers-common/index.html?org/apache/lucene/analysis/ngram/NGramTokenizer.html.
有人知道如何使用 Lucene 5.0.0 来做 ngrams 吗?
以下代码:
StringReader stringReader = new StringReader("abcd");
NGramTokenizer tokenizer = new NGramTokenizer(1, 2);
tokenizer.setReader(stringReader);
tokenizer.reset();
CharTermAttribute termAtt = tokenizer.getAttribute(CharTermAttribute.class);
while (tokenizer.incrementToken()) {
String token = termAtt.toString();
System.out.println(token);
}
将产生:
a
ab
b
bc
c
cd
d