Maven 上的 Lucene - java.lang.IllegalArgumentException UTF8 编码长于最大长度 32766 错误

Question

我正在尝试使用 Lucene Maven 为超过字符串长度限制的大型文档建立索引。然后，我收到此错误。

Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[65, 32, 98, 101, 110, 122, 111, 100, 105, 97, 122, 101, 112, 105, 110, 101, 32, 91, 116, 112, 108, 93, 73, 80, 65, 99, 45, 101, 110, 124]...', original message: bytes can be at most 32766 in length; got 85391

代码如下（复制自http://lucenetutorial.com/lucene-in-5-minutes.html，稍作改动，从文件中读取文档。）：

File file = "doc.txt";

StandardAnalyzer analyzer = new StandardAnalyzer();
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter w = new IndexWriter(index, config);
Document doc = new Document();
Scanner scanner = new Scanner(file))
     while (scanner.hasNextLine())
     {
          String line = scanner.nextLine();
          doc.add(new StringField("content", line, Field.Store.YES));
          w.addDocument(doc);
     }

...

还有其他帖子与我遇到的问题相同，但它们是针对 SOLR 或 Elasticsearch 的解决方案，而不是针对 Maven 上的 Lucene，所以我不太确定如何解决这个问题。

任何人都可以告诉我正确的地方来解决这个问题吗？

提前致谢。

Answer 1

如果你想索引文本而不是单个单词，你应该使用可以将文本分解为单词的东西，比如 WhitespaceAnalyzer。

Maven 上的 Lucene - java.lang.IllegalArgumentException UTF8 编码长于最大长度 32766 错误

Lucene on Maven - java.lang.IllegalArgumentException UTF8 encoding is longer than the max length 32766 error

java

apache

lucene

maven