Tika Bridge 在 Hibernate Search 6 中被弃用。替代品?
Tika Bridge is deprecated in Hibernate Search 6. Alternatives?
在 Hibernate Search 6 中,Apache Tika 桥已消失:
https://docs.jboss.org/hibernate/search/6.0/migration/html_single/#tikabridge
现在索引 PDF 或 Word 文档文件内容的最佳方法是什么?还有其他选择吗?
您可以编写自己的网桥,如 documented here。
像这样:
public class TikaBridge implements ValueBridge<String, String> {
private final Parser parser;
public TikaBridge() {
parser = new AutoDetectParser();
}
@Override
public String toIndexedValue(String documentPath, ValueBridgeToIndexedValueContext context) {
if (value == null) {
return null;
}
try (InputStream input = Files.newInputStream(Paths.get(documentPath)) {
StringWriter writer = new StringWriter();
WriteOutContentHandler contentHandler = new WriteOutContentHandler(writer);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
parser.parse(input, contentHandler, metadata, parseContext);
return writer.toString();
}
}
}
然后实现一个注解及其处理器:
@Retention(RetentionPolicy.RUNTIME)
@Target({ ElementType.METHOD, ElementType.FIELD })
@PropertyMapping(processor = @PropertyMappingAnnotationProcessorRef(
type = TikaField.Processor.class
))
@Documented
@Repeatable(TikaField.List.class)
public @interface TikaField {
String name() default "";
ContainerExtraction extraction() default @ContainerExtraction();
@Documented
@Target({ ElementType.METHOD, ElementType.FIELD })
@Retention(RetentionPolicy.RUNTIME)
@interface List {
TikaField[] value();
}
class Processor implements PropertyMappingAnnotationProcessor<TikaField> {
@Override
public void process(PropertyMappingStep mapping, TikaField annotation,
PropertyMappingAnnotationProcessorContext context) {
TikaBridge bridge = new TikaBridge();
mapping.genericField(annotation.name().isEmpty() ? null : annotation.name())
.valueBridge(bridge)
.extractors(context.toContainerExtractorPath(annotation.extraction()));
}
}
}
然后在您的模型上使用它:
public class MyEntity {
// ...
@TikaField
String myDocument;
}
如果您需要任何参数,可以将它们添加到注释中并将它们传递给桥的构造函数。
如果您需要从单个 PDF/Word 文档填充多个字段,例如索引元数据和文档内容,那么您将不得不实现 PropertyBridge:它允许填充多个字段而不是一个字段。这有点复杂,但相似。
在 Hibernate Search 6 中,Apache Tika 桥已消失:
https://docs.jboss.org/hibernate/search/6.0/migration/html_single/#tikabridge
现在索引 PDF 或 Word 文档文件内容的最佳方法是什么?还有其他选择吗?
您可以编写自己的网桥,如 documented here。
像这样:
public class TikaBridge implements ValueBridge<String, String> {
private final Parser parser;
public TikaBridge() {
parser = new AutoDetectParser();
}
@Override
public String toIndexedValue(String documentPath, ValueBridgeToIndexedValueContext context) {
if (value == null) {
return null;
}
try (InputStream input = Files.newInputStream(Paths.get(documentPath)) {
StringWriter writer = new StringWriter();
WriteOutContentHandler contentHandler = new WriteOutContentHandler(writer);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
parser.parse(input, contentHandler, metadata, parseContext);
return writer.toString();
}
}
}
然后实现一个注解及其处理器:
@Retention(RetentionPolicy.RUNTIME)
@Target({ ElementType.METHOD, ElementType.FIELD })
@PropertyMapping(processor = @PropertyMappingAnnotationProcessorRef(
type = TikaField.Processor.class
))
@Documented
@Repeatable(TikaField.List.class)
public @interface TikaField {
String name() default "";
ContainerExtraction extraction() default @ContainerExtraction();
@Documented
@Target({ ElementType.METHOD, ElementType.FIELD })
@Retention(RetentionPolicy.RUNTIME)
@interface List {
TikaField[] value();
}
class Processor implements PropertyMappingAnnotationProcessor<TikaField> {
@Override
public void process(PropertyMappingStep mapping, TikaField annotation,
PropertyMappingAnnotationProcessorContext context) {
TikaBridge bridge = new TikaBridge();
mapping.genericField(annotation.name().isEmpty() ? null : annotation.name())
.valueBridge(bridge)
.extractors(context.toContainerExtractorPath(annotation.extraction()));
}
}
}
然后在您的模型上使用它:
public class MyEntity {
// ...
@TikaField
String myDocument;
}
如果您需要任何参数,可以将它们添加到注释中并将它们传递给桥的构造函数。
如果您需要从单个 PDF/Word 文档填充多个字段,例如索引元数据和文档内容,那么您将不得不实现 PropertyBridge:它允许填充多个字段而不是一个字段。这有点复杂,但相似。