kappa 架构和 lambda 架构之间有什么区别

What are the differences between kappa-architecture and lambda-architecture

如果Kappa-Architecture直接对流进行分析而不是将数据分成两个流,那么在像Kafka这样的消息系统中,数据存储在哪里?还是可以在数据库中重新计算?

单独的批处理层是否比使用流处理引擎重新计算批处理分析更快?

"A very simple case to consider is when the algorithms applied to the real-time data and to the historical data are identical. Then it is clearly very beneficial to use the same code base to process historical and real-time data, and therefore to implement the use-case using the Kappa architecture". "Now, the algorithms used to process historical data and real-time data are not always identical. In some cases, the batch algorithm can be optimized thanks to the fact that it has access to the complete historical dataset, and then outperform the implementation of the real-time algorithm. Here, choosing between Lambda and Kappa becomes a choice between favoring batch execution performance over code base simplicity". "Finally, there are even more complex use-cases, in which even the outputs of the real-time and batch algorithm are different. For example, a machine learning application where generation of the batch model requires so much time and resources that the best result achievable in real-time is computing and approximated updates of that model. In such cases, the batch and real-time layers cannot be merged, and the Lambda architecture must be used".

Quote

  • 单独的批处理和流层
  • 更高的代码复杂度
  • 使用单独的 batch/stream
  • 性能更快
  • 更好地处理批处理和流处理中的不同算法
  • 批量计算的数据存储比数据库更便宜

  • 只有一个蒸汽处理层
  • 更易于维护,复杂度更低,批量和单一算法 流
  • 如果从数据库重新计算批处理数据太多,成本会很高
  • 如果从数据库或 kafka 重新计算批处理数据太多,处理速度会变慢

您可能还想阅读讨论这两者的原始文章here

引用原博文post

"The efficiency and resource trade-offs between the two approaches are somewhat of a wash. The Lambda Architecture requires running both reprocessing and live processing all the time, whereas what I have proposed only requires running the second copy of the job when you need reprocessing. However, my proposal requires temporarily having 2x the storage space in the output database and requires a database that supports high-volume writes for the re-load. In both cases, the extra load of the reprocessing would likely average out. If you had many such jobs, they wouldn’t all reprocess at once, so on a shared cluster with several dozen such jobs you might budget an extra few percent of capacity for the few jobs that would be actively reprocessing at any given time.

The real advantage isn’t about efficiency at all, but rather about allowing people to develop, test, debug, and operate their systems on top of a single processing framework. So, in cases where simplicity is important, consider this approach as an alternative to the Lambda Architecture."