将 "contents" 字段更改为存储、标记化、索引以突出显示

Question

这就是我从 LucenePDFDocument:

中获取详细信息的方式

doc = LucenePDFDocument.getDocument(file);
System.out.println("field list: \n" + doc.getFields());

这是输出：

field list: 
[<stored<path:D:\Kuliah\rancangan document indexing\dir-pdf\dua.pdf>,
stored<url:D:/Kuliah/rancangan document indexing/dir-pdf/dua.pdf>,
stored,indexed,omitNorms,indexOptions=DOCS<modified:20170307220729>,
indexed,tokenized<uid:D Kuliah rancangan document indexing dir-pdf dua.pdf 20170307220729>, 
indexed,tokenized<contents:java.io.StringReader@4206a205>,
stored,indexed,tokenized<Author:Acer-2577>,
stored,indexed,tokenized<CreationDate:20150222074338>,
stored,indexed,tokenized<Creator:PDF24 Creator>,
stored,indexed,tokenized<ModificationDate:20150222074338>,
stored,indexed,tokenized<Producer:GPL Ghostscript 9.10>,
stored,indexed,tokenized<Title:Microsoft Word - Vol 10.1 bag ke 2a fix.doc>,
stored<summary:Jurnal Teknologi Informasi, Volume 10 Nomor 1, April ...>]

我想在 "contents" 字段中突出显示检索到的词。 Highlight 需要一个存储字段，但 "contents" 字段只是索引和标记化。我收到如下错误："contents field is not stored".

我应该怎么做才能使 "contents" 字段存储、标记化和索引？应该编辑 LucenePDFDocument.java 吗？哪一部分？

Answer 1

是的，内容字段已编入索引但未存储，这意味着它不会从搜索结果中返回，但可以搜索，是的，这不适用于荧光笔。

您需要修改 LucenePDFDocument class 才能存储该字段。为此，只需将字符串而不是 reader 传递给 addTextField 调用：

String contents = writer.getBuffer().toString();
addTextField(document, "contents", contents);

您可能还应该删除 "summary" 字段，因为如果您要存储完整内容，则不需要它。

将 "contents" 字段更改为存储、标记化、索引以突出显示

change "contents" field to stored,tokenized,indexed for highlight

lucene

pdfbox