加快抓取过程

Question

正在使用 ES 6.5.x 和 Storm 爬虫 1.10。我怎样才能加快爬虫获取 records.When 我检查它的指标显示平均每秒 0.4 页。在下面的爬虫配置中我需要更改什么吗？

爬虫配置文件：

config: 
  topology.workers: 2
  topology.message.timeout.secs: 300
  topology.max.spout.pending: 100
  topology.debug: false
  fetcher.server.delay: .25
  fetcher.threads.number: 200
  fetcher.threads.per.queue: 5

  worker.heap.memory.mb: 2048

  topology.kryo.register:
    - com.digitalpebble.stormcrawler.Metadata

  http.content.limit: -1
  fetchInterval.default: 1440
  fetchInterval.fetch.error: 120
  fetchInterval.error: -1
  topology.metrics.consumer.register:
     - class: "org.apache.storm.metric.LoggingMetricsConsumer"
       parallelism.hint: 1

Answer 1

如果您正在抓取单个站点，那么您不需要 2 个工作人员或一个以上的 ES 分片和 spout！无论如何，所有 URL 都会被定向到一个分片！

您每个队列使用 5 个线程，但每个桶仅从 ES 中检索 2 URLs (es.status.max.urls.per.bucket: 2) 并强制在两者之间间隔 2 秒调用 ES (spout.min.delay.queries: 2000) 所以 spout 平均每秒不能产生超过 1 URL。此外 refresh_interval in ES_IndexInit.sh 影响索引中可见变化的速度，因此影响您获得新鲜度的可能性URLs 来自请求。

只需将 es.status.max.urls.per.bucket 更改为更大的值，例如10 并将 spout.min.delay.queries 降低到与 refresh_interval 相同的值 ES_IndexInit.sh 例如1秒。这会让你得到更多 URLs.

加快抓取过程

Speed up the crawling process

web-crawler

stormcrawler