有没有办法从 Lucene 索引中获取字数?

Is there a way to get the word-count from a Lucene index?

我使用的是最新的 Apache-Lucene 版本 8.1.1

是否可能以及如何获取存储在 Lucene 索引中的所有(非停用词)术语的字数?结果应该是:

term1 453443
term2 445484
term3 443333

等等

我在 Java 或 Scala 中需要这个,但任何语言都可以很好地说明 API ...

您可以在下面找到示例实现。

请注意,计数给出的是文档数而不是出现次数(lucene 字数 4,文档数 3)。也没有省略停用词。

import java.io.IOException;
import java.nio.file.Paths;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.misc.HighFreqTerms;
import org.apache.lucene.misc.TermStats;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class LuceneTest2 {
    final static String index = "index";
    final static String field = "text";

    public static void index() {
        try {
            Directory dir = FSDirectory.open(Paths.get(index));
            Analyzer analyzer = new StandardAnalyzer();
            IndexWriterConfig iwc = new IndexWriterConfig(analyzer);

            iwc.setOpenMode(OpenMode.CREATE);

            IndexWriter writer = new IndexWriter(dir, iwc);

            String[] lines = {
                    "lucene java lucene mark",
                    "lucene",
                    "lucene an example",
                    "java python"
            };
            for (int i = 0; i < lines.length; i++) {
                String line = lines[i];
                Document doc = new Document();
                doc.add(new StringField("id", "" + i, Field.Store.YES));
                doc.add(new TextField(field, line.trim(), Field.Store.YES));
                writer.addDocument(doc);
            }

            System.out.println("indexed " + lines.length + " sentences");
            writer.close();
        } catch (IOException e) {
            System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage());
        }
    }

    public static void count() {
        try {
            IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
            int numTerms = 100;
            TermStats[] stats = HighFreqTerms.getHighFreqTerms(reader, numTerms, field, new HighFreqTerms.DocFreqComparator());
            for (TermStats termStats : stats) {
                String termText = termStats.termtext.utf8ToString();
                System.out.println(termText + " " + termStats.docFreq);
            }
            reader.close();
        } catch (Exception e) {
            System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage());
        }
    }

    public static void main(String[] args) {
        index();
        count();
    }
}

这输出:

lucene 3
java 2
python 1
mark 1
example 1
an 1