设置 Stormcrawler 和 ElasticSearch 来抓取我们的网站 html 文件和 pdf 文档

Setting up Stormcrawler and ElasticSearch to crawl our website html file and pdf documents

我们正在使用 StormCrawler 和 ElasticSearch 来抓取我们的网站。我们遵循了将 ElasticSearch 与 StormCrawler 结合使用的文档。当我们在 Kibana 中搜索时，我们确实会返回 html 个文件结果，但不会返回 pdf 文件内容或链接。我们如何设置 Stormcrawler 以在 Elastic 中抓取和存储我们的网站 html 和 pdf 文件内容。我们需要进行哪些配置更改。这与外链设置有关吗？是否有文档告诉我们如何设置 StormCrawler 和 ElasticSearch 来抓取 html 和 pdf 文档？

您可能正在查看 Kibana 中的 'content' 索引，但还应查看 'status' 索引，后者应包含 PDF 文档。快速查看日志还会告诉您正在获取 PDF，但解析器正在跳过它们。状态索引包含 ERROR 状态和提及 'content-type checking'.

的消息

那么，你是如何解决的？只需将 Tika 模块添加为 Maven 依赖项并按照其 README 上的步骤操作，这样 PDF 文档将被重定向到能够从中提取文本和元数据的 Tika Parsing Bolt。然后应将它们正确编入 'content' 索引。

设置 Stormcrawler 和 ElasticSearch 来抓取我们的网站 html 文件和 pdf 文档

Setting up Stormcrawler and ElasticSearch to crawl our website html file and pdf documents

html

pdf

elasticsearch

stormcrawler