爬虫设计——调用异步作业与调用服务

crawler design - calling an async job vs. calling a service

我正在查看 donne martin's design for a web crawler。 爬虫服务处理一个新爬取的url,然后:

  • Adds a job to the Reverse Index Service queue to generate a reverse index
  • Adds a job to the Document Service queue to generate a static title and snippet

如果爬虫服务同步调用这两个服务,会发生什么情况?我仍然可以根据每个服务的负载水平扩展所有 3 个服务,对吗?我想到的一个可能原因是,如果其中一个失败,则流量控制会更复杂。这些异步作业还有其他更有说服力的理由吗?

这种设计选择背后可能有更多原因,但几乎可以肯定的是 Microservices 的使用。这是一种流行的技术,因此演示如何掌握它是回答设计问题的好主意,它的好处在维基百科上有详细描述:

  • Modularity: This makes the application easier to understand, develop, test, and become more resilient to architecture erosion.[6] This benefit is often argued in comparison to the complexity of monolithic architectures.[33]
  • Scalability: Since microservices are implemented and deployed independently of each other, i.e. they run within independent processes, they can be monitored and scaled independently.[34]
  • Integration of heterogeneous and legacy systems: microservices is considered as a viable mean for modernizing existing monolithic software application.[35][36] There are experience reports of several companies who have successfully replaced (parts of) their existing software by microservices, or are in the process of doing so.[37] The process for Software modernization of legacy applications is done using an incremental approach.[38]
  • Distributed development: it parallelizes development by enabling small autonomous teams to develop, deploy and scale their respective services independently.[39] It also allows the architecture of an individual service to emerge through continuous refactoring.[40] Microservice-based architectures facilitate continuous integration, continuous delivery and deployment.[41] [42]

所有这些都适用于这种情况。事实上,定义明确的 API 使模块分离、可重用且易于理解。很可能这 3 个模块中的每一个都有非常不同的执行时间和 CPU/memory 要求,因此分别扩展它们很有意义。页面上提到的亚马逊等一些公司可能会根据团队编号进一步将这些模块拆分为微服务,因此可以根据拥有 3 个团队的假设而不是技术限制来很好地选择拆分为 3 个服务。

该页面还描述了对该技术的批评。

what would happen if instead the crawler service would synchronously call these 2 services?

第一点——那么最慢的服务就会成为爬虫的瓶颈。同步调用是指爬虫需要等待请求被服务处理。在队列的情况下,爬虫将工作得更快,处理新的 links 而不是等待其他服务。我们可以假设爬虫可以有自己的内部队列。

第二点——耐用性。如果任何服务出现故障并且无法处理来自爬虫的请求,一个 link 或多个丢失可能并不那么重要。但是队列可以是持久的,在磁盘上保存状态,在它停止的地方恢复它的工作。如果所有服务同时关闭并且许多 link 将丢失,这将非常有用。

what came to me as a possible reason is just more complex flow control if one of them fails

这种方法不灵活。通常,您应该能够根据需要轻松添加任意数量的新服务以扩展工作负载,而无需更改任何代码。因此,“流量控制”不应作为每次添加或删除服务实例时都需要修改的代码存在。在可以向上和向下扩展的实际应用程序中,所有这些事情都是自动完成的,无需重新部署应用程序。