爬虫中如何使用MessageQueue?

How to use MessageQueue in Crawler?

看起来MessageQueue应该是构建Web Crawler的一个很好的架构解决方案,但我仍然无法理解如何去做。

让我们考虑第一种情况共享数据库,它很漂亮 清楚如何去做算法将是经典的 Graph Traversal:

There are multiple Workers and shared database.

- I manually put the first url into the database

while true

  - worker get random discovered url from database.
  - worker parses it and gets list of all links on the page.
  - worker updates the url in the database as processed.
  - worker lookup into the database and separates the found links 
    into processed, discovered and the new ones.
  - worker add the new ones links to database as discovered.

让我们考虑第二种情况,使用 MessageQueue

There are MessageQueue containing urls that should be processed 
and multiple Workers.

- I manually put the first url in the Queue.

while true

  - worker takes next discovered url from the Queue.
  - worker parsers it and gets list of all links on the page.
  - what it does next? How it separates found links into
    processed, discovered and the new ones?
  - worker puts the list of new urls into the Queue as discovered.

问题:

what it does next? How it separates found links into processed, discovered and the new ones?

您可以为这些设置单独的队列,这些队列将流回您的数据库。这个想法是你可以有多个工作人员,一个反馈循环将新发现的 URL 发送回队列进行处理,然后发送到数据库进行存储。

How to separate the links found on the page into processed, discovered and the new ones? It's clear how to do it in case of DB - just lookup in DB and check every link, but how to do it in case of MessageQueue?

您可能仍会在数据库中查找来自队列的 link 个。

因此,工作流程如下所示: Link 被丢弃在队列中 队列工作人员拿起它,并检查数据库以查看 link 是否已处理 如果未处理,请调用网站以检索其他出站 links 解析页面,并将每个出站 link 放到队列中进行处理

Is it ok to keep all discovered urls in the MessageQueue? What if there are thousands of sites with thousands of pages, there would be millions messages waiting in the Queue.

可能不是,这就是数据库的用途。处理完事情后,您应该将它们从队列中删除。队列是为了……排队。消息传输。不用于数据存储。数据库是用来存储数据的。

现在,在它们被处理之前,是的,您可以将它们留在队列中。如果您担心队列容量,您可以修改工作流程,以便队列工作人员删除任何已处理的 links,这应该会减少队列的深度。它甚至可能更有效率。