如何创建 Lucene 索引,其中文档是扫描图像等?

How to create Lucene index where the documents are scanned images among other things?

我的数据库将简历存储为 blob 数据字段。简历可能是 Microsoft word、pdf 或图像(.jpg 等)。我们如何从这些不同的文件类型(特别是 .jpg 文件)中创建 Lucene 索引? Tika 能看懂扫描图像吗?

When extracting from images, it is also possible to chain in Tesseract, via the TesseractOCRParser, to have OCR performed on the contents of the image.

查看有关图像的 Apache Tika 文档:https://tika.apache.org/1.20/formats.html#Image_formats