Stormcrawler:用于解析 PDF 属性的 Apache Tika

Stormcrawler: Apache Tika for parsing PDF properties

我添加了 Tika 作为对我的 StormCrawler 实现的引用,它可以在爬网中获取 PDF 文档。但是,TitleAuthors 和其他属性不会被解析。我已经尝试使用不同的组合来 'index.md.mapping:' 并将相应的属性添加到 ES_IndexInit,但是内容PDF 文档的 Kibana(索引)字段始终为空。一切都适用于 HTML 页。如果我遗漏了什么或者我可以看一个例子,你能帮忙指点一下吗?


es-crawler.flux:

姓名:"crawler"</p> <p>包括: - 资源:真实 文件:“/爬虫-default.yaml” 覆盖:假</p> <pre><code>- resource: false file: "crawler-conf.yaml" override: true - resource: false file: "es-conf.yaml" override: true

喷出: - 编号:"spout" 类名:"com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout" 并行度:10

螺栓: - 编号:"partitioner" 类名:"com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt" 并行度:1 - 编号:"fetcher" 类名:"com.digitalpebble.stormcrawler.bolt.FetcherBolt" 并行度:1 - 编号:"sitemap" 类名:"com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt" 并行度:1 - 编号:"parse" 类名:"com.digitalpebble.stormcrawler.bolt.JSoupParserBolt" 并行度:5 - 编号:"index" 类名:"com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt" 并行度:1 - 编号:"status" 类名:"com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt" 并行度:1 - 编号:"status_metrics" 类名:"com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt" 并行度:4 - 编号:"redirection_bolt" 类名:"com.digitalpebble.stormcrawler.tika.RedirectionBolt" 并行度:1 - 编号:"parser_bolt" 类名:"com.digitalpebble.stormcrawler.tika.ParserBolt" 并行度:1

流: - 来自:"spout" 至:"partitioner" 分组: 类型:随机播放

es-injector.flux: 姓名:"injector"</p> <p>包括: - 资源:真实 文件:“/爬虫-default.yaml” 覆盖:假</p> <pre><code>- resource: false file: "crawler-conf.yaml" override: true - resource: false file: "es-conf.yaml" override: true - resource: false file: "injection-conf.yaml" override: true

组件: - 编号:"scheme" 类名:"com.digitalpebble.stormcrawler.util.StringTabScheme" 构造函数参数: - 发现

喷出: - 编号:"spout" 类名:"com.digitalpebble.stormcrawler.spout.FileSpout" 并行度:1 构造函数参数: - “。” - "seeds.txt" - 参考:"scheme"

螺栓: - 编号:"status" 类名:"com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt" 并行度:1 - 编号:"parser_bolt" 类名:"com.digitalpebble.stormcrawler.tika.ParserBolt" 并行度:1

流: - 来自:"spout" 至:"status" 分组: 类型:领域 参数:["url"]

pom.xml: http://maven.apache.org/maven-v4_0_0.xsd"></p> <pre><code><modelVersion>4.0.0</modelVersion> <groupId>xyz.com</groupId> <artifactId>search</artifactId> <version>search1.0</version> <packaging>jar</packaging> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.2</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>exec-maven-plugin</artifactId> <version>1.3.2</version> <executions> <execution> <goals> <goal>exec</goal> </goals> </execution> </executions> <configuration> <executable>java</executable> <includeProjectDependencies>true</includeProjectDependencies> <includePluginDependencies>false</includePluginDependencies> <classpathScope>compile</classpathScope> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>1.3.3</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <createDependencyReducedPom>false</createDependencyReducedPom> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" /> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass>org.apache.storm.flux.Flux</mainClass> <manifestEntries> <Change></Change> <Build-Date></Build-Date> </manifestEntries> </transformer> </transformers> <!-- The filters below are necessary if you want to include the Tika module --> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> </configuration> </execution> </executions> </plugin> </plugins> </build> <dependencies> <dependency> <groupId>org.apache.storm</groupId> <artifactId>storm-core</artifactId> <version>1.1.1</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.storm</groupId> <artifactId>flux-core</artifactId> <version>1.0.2</version> </dependency> <dependency> <groupId>com.digitalpebble.stormcrawler</groupId> <artifactId>storm-crawler-core</artifactId> <version>1.7</version> </dependency> <dependency> <groupId>com.digitalpebble.stormcrawler</groupId> <artifactId>storm-crawler-elasticsearch</artifactId> <version>1.7</version> </dependency> <dependency> <groupId>com.digitalpebble.stormcrawler</groupId> <artifactId>storm-crawler-tika</artifactId> <version>1.7</version> </dependency> </dependencies>

您的 pom 和 flux 文件看起来没问题。您可以将注入作为主要助焊剂的一部分以保持简单。

爬虫中有什么-conf.yaml?您是否在字段名称前加上 'parse.'?

这是从您上面发布的URL

中提取的元数据
parse.dcterms:modified: 2004-09-29T20:21:18Z
parse.pdf:PDFVersion: 1.4
parse.access_permission:can_print: true
parse.pdf:docinfo:subject: By simple definition, metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient. A wide variety of industries use metadata, but for the purposes of digital imaging, there are currently only a few technical structures or schema that are being employed. A schema is a set of properties and their defined meanings, such as the type of value (date, size, URL, or any useful designation). 
parse.pdf:docinfo:modified: 2004-09-29T20:21:18Z
parse.access_permission:extract_for_accessibility: true
parse.created: Fri Sep 24 15:56:30 BST 2004
parse.pdf:docinfo:created: 2004-09-24T14:56:30Z
parse.xmpTPg:NPages: 7
parse.access_permission:fill_in_form: true
parse.producer: Adobe PDF Library 6.0
parse.pdf:docinfo:title: About Metadata
parse.pdf:docinfo:producer: Adobe PDF Library 6.0
parse.dc:format: application/pdf; version=1.4
parse.access_permission:assemble_document: true
parse.access_permission:modify_annotations: true
parse.dc:title: About Metadata
parse.access_permission:can_print_degraded: true
parse.xmpMM:DocumentID: adobe:docid:indd:de7d50b0-0fc1-11d9-b0d4-cd42e793ca90
parse.xmpMM:DerivedFrom:DocumentID: adobe:docid:indd:a04d199f-0f11-11d9-b74d-bb0abf4f1ab0
parse.title: About Metadata
parse.Creation-Date: 2004-09-24T14:56:30Z
parse.modified: 2004-09-29T20:21:18Z
parse.resourceName: /digitalimag/pdfs/about_metadata.pdf
parse.dc:description: By simple definition, metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient. A wide variety of industries use metadata, but for the purposes of digital imaging, there are currently only a few technical structures or schema that are being employed. A schema is a set of properties and their defined meanings, such as the type of value (date, size, URL, or any useful designation). 
parse.Last-Save-Date: 2004-09-29T20:21:18Z
parse.creator: Adobe Systems Incorporated
parse.pdf:encrypted: false
parse.trapped: False
parse.pdf:docinfo:creator: Adobe Systems Incorporated
parse.date: 2004-09-29T20:21:18Z
parse.meta:save-date: 2004-09-29T20:21:18Z
parse.Author: Adobe Systems Incorporated
parse.X-Parsed-By: org.apache.tika.parser.DefaultParser
parse.X-Parsed-By: org.apache.tika.parser.pdf.PDFParser
parse.pdf:docinfo:creator_tool: Adobe InDesign CS (3.0.1)
parse.dcterms:created: 2004-09-24T14:56:30Z
parse.access_permission:can_modify: true
parse.subject: By simple definition, metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient. A wide variety of industries use metadata, but for the purposes of digital imaging, there are currently only a few technical structures or schema that are being employed. A schema is a set of properties and their defined meanings, such as the type of value (date, size, URL, or any useful designation). 
parse.meta:author: Adobe Systems Incorporated
parse.access_permission:extract_content: true
parse.xmp:CreatorTool: Adobe InDesign CS (3.0.1)
parse.dc:creator: Adobe Systems Incorporated
parse.cp:subject: By simple definition, metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient. A wide variety of industries use metadata, but for the purposes of digital imaging, there are currently only a few technical structures or schema that are being employed. A schema is a set of properties and their defined meanings, such as the type of value (date, size, URL, or any useful designation). 
parse.pdf:docinfo:trapped: False
parse.meta:creation-date: 2004-09-24T14:56:30Z
parse.xmpMM:DerivedFrom:InstanceID: de7d50af-0fc1-11d9-b0d4-cd42e793ca90
parse.Last-Modified: 2004-09-29T20:21:18Z
parse.Content-Type: application/pdf
parse.description: By simple definition, metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient. A wide variety of industries use metadata, but for the purposes of digital imaging, there are currently only a few technical structures or schema that are being employed. A schema is a set of properties and their defined meanings, such as the type of value (date, size, URL, or any useful designation). 

你的 conf 应该包含类似

的内容
  indexer.md.mapping:
  - parse.title=title
  - parse.Author=author

从测试用例的代码可以猜到,需要在external/tika/src/test/resources/中添加文件,并在/中引用文件名测试代码,如下例中的 about_metadata.pdf

 @Test
public void testMetadata() throws IOException {

    bolt.prepare(new HashMap(), TestUtil.getMockedTopologyContext(),
            new OutputCollector(output));

    parse("https://www.adobe.com/digitalimag/pdfs/about_metadata.pdf",
            "about_metadata.pdf");

    List<List<Object>> outTuples = output.getEmitted();

    // single document
    Assert.assertEquals(1, outTuples.size());
    // metadata
    Metadata md = (Metadata) outTuples.get(0).get(2);
    Assert.assertTrue(
            md.getFirstValue("parse.pdf:docinfo:subject").contains(
                    "By simple definition, metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient."));

}

更新

仔细检查,问题出在你的助焊剂上。重定向螺栓将元组发送到名为 'tika' 的定制流上的 Tika。因此定义应该是

from: "redirection_bolt"
to: "parser_bolt"
grouping:
  type: LOCAL_OR_SHUFFLE
  streamId: "tika"