"code moving to data"而不是数据来编码的原理是什么？

What is the principle of "code moving to data" rather than data to code?

在最近关于分布式处理和流的讨论中，我遇到了 'code moving to data' 的概念。有人可以帮忙解释一下吗？此短语的参考是 MapReduceWay.

在 Hadoop 方面，it's stated in a question但仍然无法以技术不可知的方式找出原理的解释。

基本思路很简单：如果代码和数据在不同的机器上，必须先将其中一个移动到另一台机器上，然后才能在数据上执行代码。如果代码小于数据，最好将代码发送到保存数据的机器，而不是相反，如果所有机器都同样快且代码兼容。 [可以说您可以根据需要发送源代码和 JIT 编译]。

在大数据的世界里，代码几乎总是比数据小。

在许多超级计算机上，数据跨多个节点进行分区，整个应用程序的所有代码都在所有节点上复制，正是因为与本地存储的数据相比，整个应用程序很小。然后任何节点都可以运行应用到它持有的数据的程序部分。无需按需发码

我也刚刚看到一句话“移动计算比移动数据便宜”（来自Apache Hadoop documentation），经过一些阅读我认为这是指数据局部性的原则。

数据局部性是一种任务调度策略，旨在基于通过网络移动数据成本高昂的观察来优化性能，因此当 computing/data 节点空闲时选择要优先处理的任务时，偏好会分配给将要对空闲节点或其附近的数据进行操作的任务。

这（来自延迟调度：实现的简单技术 Locality and Fairness in Cluster Scheduling, Zaharia et al., 2010) 解释得很清楚：

Hadoop’s default scheduler runs jobs in FIFO order, with five priority levels. When the scheduler receives a heartbeat indicating that a map or reduce slot is free, it scans through jobs in order of priority and submit time to find one with a task of the required type. For maps, Hadoop uses a locality optimization as in Google’s MapReduce [18]: after selecting a job, the scheduler greedily picks the map task in the job with data closest to the slave (on the same node if possible, otherwise on the same rack, or finally on a remote rack).

请注意，Hadoop 跨节点复制数据这一事实增加了任务的公平调度（复制越高，任务在下一个空闲节点上拥有数据并因此被选择到运行的可能性就越高下一个）。

"code moving to data"而不是数据来编码的原理是什么？

What is the principle of "code moving to data" rather than data to code?

architecture

hadoop

mapreduce

distributed-computing

design-principles