使用 nutch 抓取图像及其元数据并将它们索引到 solr

Question

我想构建一个基于图像的迷你搜索引擎，我可以向其提供图像文件，它将在 solr 中搜索相似的图像。我在爬行部分使用 nutch 并将数据索引到 solr 中。我已经对 nutch conf 文件进行了更改，例如 -

已将 image/* 添加到 mimetype-filter.txt
从 suffix-urlfilter.txt 中删除了图像扩展 - 不要跳过它们

我也将字段添加到 solr schema.xml -

<field name="name" type="string" indexed="true" stored="true" />
<field name="iso" type="string" indexed="true" stored="true" multiValued="true" />
<field name="iso_string" type="string" indexed="true" stored="true" multiValued="true" />
<field name="aperture" type="double" indexed="true" stored="true" />
<field name="exposure" type="string" indexed="true" stored="true" />
<field name="exposure_time" type="double" indexed="true" stored="true" />
<field name="focal" type="string" indexed="true" stored="true" />
<field name="focal_35" type="string" indexed="true" stored="true" />
<dynamicField name="ignored_*" type="string" indexed="false" stored="false" multiValued="true" />

但是我爬取的时候，没有数据索引到solr中。我找不到与此相关的任何 documentation/tutorial。我还浏览了一些关于使用 nutch 进行图像抓取的 Whosebug 上的帖子。但我没有发现这些有帮助。

有人可以指导我正确的方向吗？提前致谢。

Answer 1

这个问题没有easy/short答案，解析图像是一件棘手的事情，即使不涉及抓取部分。在您已经完成的工作之上，您首先需要启用 parse-tika 插件（parse-html 仅处理 HTML 文档）。 Apache Tika 能够提取有关图像的一些元数据。

您还需要启用 mimetype-filter 插件（这不仅是编辑配置文件，而且是在 nutch-site.xml 文件中启用）。完成这些配置后，您应该尝试 bin/nutch parsechecker <URL> 工具来测试包含一些图像的 URL，看看是否可以找到 Outlinks 中图像的 URL部分。此外，检查运行将 parsechecker 与图像 URL 进行比较，以查看 parsechecker 正在提取哪些元数据。在此之后，运行 bin/nutch indexchecker 工具针对两个 URLs 并检查它将哪些字段索引到 Solr 中并相应地在您的模式中创建它们。请记住，Tika 可能会为每种格式提取不同的元数据。

使用 nutch 抓取图像及其元数据并将它们索引到 solr

Crawl image and their metadata using nutch and index them into solr

apache

solr

image

web-crawler

nutch