StormCrawler 的原型拓扑不获取外链

Question

根据我的理解，基本示例应该能够抓取和获取页面。

我按照 http://stormcrawler.net/getting-started/ 上的示例进行了操作，但抓取工具似乎只抓取了几页，然后什么也没做。

我想抓取 http://books.toscrape.com/ 和运行抓取，但在日志中看到只有第一页被抓取，其他一些被发现但没有抓取：

8010 [Thread-34-parse-executor[5 5]] INFO  c.d.s.b.JSoupParserBolt - Parsing : starting http://books.toscrape.com/
8214 [Thread-34-parse-executor[5 5]] INFO  c.d.s.b.JSoupParserBolt - Parsed http://books.toscrape.com/ in 182 msec
content 1435 chars
url     http://books.toscrape.com/
domain  toscrape.com
description
title   All products | Books to Scrape - Sandbox
http://books.toscrape.com/catalogue/category/books/new-adult_20/index.html      DISCOVERED      Thu Apr 05 13:46:01 CEST 2018
        url.path: http://books.toscrape.com/
        depth: 1

http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html   DISCOVERED      Thu Apr 05 13:46:01 CEST 2018
        url.path: http://books.toscrape.com/
        depth: 1

http://books.toscrape.com/catalogue/category/books/thriller_37/index.html       DISCOVERED      Thu Apr 05 13:46:01 CEST 2018
        url.path: http://books.toscrape.com/
        depth: 1

http://books.toscrape.com/catalogue/category/books/academic_40/index.html       DISCOVERED      Thu Apr 05 13:46:01 CEST 2018
        url.path: http://books.toscrape.com/
        depth: 1

http://books.toscrape.com/catalogue/category/books/classics_6/index.html        DISCOVERED      Thu Apr 05 13:46:01 CEST 2018
        url.path: http://books.toscrape.com/
        depth: 1

http://books.toscrape.com/catalogue/category/books/paranormal_24/index.html     DISCOVERED      Thu Apr 05 13:46:01 CEST 2018
        url.path: http://books.toscrape.com/
        depth: 1



....




17131 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      6:partitioner URLPartitioner           {}
17164 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      8:spout       queue_size               0
17403 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      5:parse       JSoupParserBolt          {tuple_success=1, outlink_kept=73}
17693 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      3:fetcher     num_queues               0
17693 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      3:fetcher     fetcher_average_perdoc   {time_in_queues=265.0, bytes_fetched=51294.0, fetch_time=52.0}
17693 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      3:fetcher     fetcher_counter          {robots.fetched=1, bytes_fetched=51294, fetched=1}
17693 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      3:fetcher     activethreads            0
17693 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      3:fetcher     fetcher_average_persec   {bytes_fetched_perSec=5295.137813564571, fetched_perSec=0.10323113451016827}
17693 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      3:fetcher     in_queues                0
27127 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      6:partitioner URLPartitioner           {}
27168 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      8:spout       queue_size               0
27405 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      5:parse       JSoupParserBolt          {tuple_success=0, outlink_kept=0}
27695 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      3:fetcher     num_queues               0
27695 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      3:fetcher     fetcher_average_perdoc   {}
27695 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      3:fetcher     fetcher_counter          {robots.fetched=0, bytes_fetched=0, fetched=0}
27695 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      3:fetcher     activethreads            0
27696 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      3:fetcher     fetcher_average_persec   {bytes_fetched_perSec=0.0, fetched_perSec=0.0}

没有更改配置文件。包括爬虫-conf.yaml。此外，标志 parser.emitOutlinks 应该为真，因为这是爬虫的默认设置-default.yaml

在另一个项目中，我还关注了有关 elasticsearch 的 youtube 教程。在这里，我还遇到了根本没有页面被提取和索引的问题。

爬虫没有抓取任何页面可能是哪里的错误？

Answer 1

人工制品生成的拓扑仅仅是一个示例，它使用 StdOutStatusUpdater，它只是将发现的 URL 转储到控制台。如果您运行处于本地模式或只有一个 worker，则可以使用 MemoryStatusUpdater，因为它将发现的 URL 添加到 MemorySpout 中，然后将依次处理这些 URL。

请注意，当您终止拓扑或拓扑崩溃时，这不会保留有关 URL 的信息。同样，这只是为了调试和使用 StormCrawler 的初始步骤。

如果您希望保留 URL，可以使用任何持久性后端（SOLR/Elasticsearch，SQL）。请随意将您的 ES 问题作为一个单独的问题来描述。

StormCrawler 的原型拓扑不获取外链

StormCrawler's archetype topology does not fetch outlinks

web-crawler

apache-storm

stormcrawler