什么是 YARN (Hadoop) - 非常简单的例子
What is YARN (Hadoop) - very simple example
我是 Hadoop 的新手,正在尝试了解它。我找到了一个很好的解释
HDFS 和 MapReduce 的简单示例(见下文)。但是我不能
google YARN 的任何类似的简单示例。 有人可以解释一下吗
(对于外行来说)?
Think of a file that contains the phone numbers for everyone in the United
States; the people with a last name starting with A might be stored on server
1, B on server 2, and so on.
In a Hadoop world, pieces of this phonebook would be stored across the cluster,
and to reconstruct the entire phonebook, your program would need the blocks
from every server in the cluster. To achieve availability as components fail,
HDFS replicates these smaller pieces onto two additional servers by default.
(This redundancy can be increased or decreased on a per-file basis or for a
whole environment; for example, a development Hadoop cluster typically doesn’t
need any data redundancy.) This redundancy offers multiple benefits, the most
obvious being higher availability.
In addition, this redundancy allows the Hadoop cluster to break work up into
smaller chunks and run those jobs on all the servers in the cluster for better
scalability. Finally, you get the benefit of data locality, which is critical
when working with large data sets. We detail these important benefits later in
this chapter.
Let’s look at a simple example. Assume you have five files, and each file
contains two columns (a key and a value in Hadoop terms) that represent a city
and the corresponding temperature recorded in that city for the various
measurement days. Of course we’ve made this example very simple so it’s easy to
follow. You can imagine that a real application won’t be quite so simple, as
it’s likely to contain millions or even billions of rows, and they might not be
neatly formatted rows at all; in fact, no matter how big or small the amount of
data you need to analyze, the key principles we’re covering here remain the
same. Either way, in this example, city is the key and temperature is the
value.
Toronto, 20
Whitby, 25
New York, 22
Rome, 32
Toronto, 4
Rome, 33
New York, 18
Out of all the data we have collected, we want to find the maximum temperature
for each city across all of the data files (note that each file might have the
same city represented multiple times). Using the MapReduce framework, we can
break this down into five map tasks, where each mapper works on one of the five
files and the mapper task goes through the data and returns the maximum
temperature for each city. For example, the results produced from one mapper
task for the data above would look like this:
(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)
Let’s assume the other four mapper tasks (working on the other four files not
shown here) produced the following intermediate results:
(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20)
(New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome,
31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)
All five of these output streams would be fed into the reduce tasks, which
combine the input results and output a single value for each city, producing a
final result set as follows:
(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)
As an analogy, you can think of map and reduce tasks as the way a census was
conducted in Roman times, where the census bureau would dispatch its people to
each city in the empire. Each census taker in each city would be tasked to
count the number of people in that city and then return their results to the
capital city.
There, the results from each city would be reduced to a single count (sum of
all cities) to determine the overall population of the empire. This mapping of
people to cities, in parallel, and then combining the results (reducing) is
much more efficient than sending a single person to count every person in the
empire in a serial fashion.
假设您有 4 台机器,每台机器都有 4GB RAM 和双核 CPUs。
您可以将 YARN 呈现给能够分配和并行化工作负载的应用程序,例如 MapReduce,YARN 将响应它能够在 8 CPU 个内核上接受 16GB 的应用程序工作负载。
并非所有节点都需要相同,有些节点可以与 GPU 资源或更高的内存吞吐量一起使用,但对于任何单个 运行 应用程序,您始终会受到组中最小节点的限制。 .. 框架根据可用资源决定将代码部署到哪个节点,而不是你。当 NodeManager 与 HDFS 数据节点组合时(它们在同一台机器上 运行),您尝试读取文件的代码将被包含部分文件的机器尝试 运行需要。
基本上,将您的存储分成小块 (HDFS),提供一种将这些块读入完整文件 (MapReduce) 的方法,并使用一些处理引擎公平地分配该操作,或贪婪地分配到资源池中(YARN 的 Fair调度程序或容量调度程序)
我是 Hadoop 的新手,正在尝试了解它。我找到了一个很好的解释 HDFS 和 MapReduce 的简单示例(见下文)。但是我不能 google YARN 的任何类似的简单示例。 有人可以解释一下吗 (对于外行来说)?
Think of a file that contains the phone numbers for everyone in the United States; the people with a last name starting with A might be stored on server 1, B on server 2, and so on.
In a Hadoop world, pieces of this phonebook would be stored across the cluster, and to reconstruct the entire phonebook, your program would need the blocks from every server in the cluster. To achieve availability as components fail, HDFS replicates these smaller pieces onto two additional servers by default. (This redundancy can be increased or decreased on a per-file basis or for a whole environment; for example, a development Hadoop cluster typically doesn’t need any data redundancy.) This redundancy offers multiple benefits, the most obvious being higher availability.
In addition, this redundancy allows the Hadoop cluster to break work up into smaller chunks and run those jobs on all the servers in the cluster for better scalability. Finally, you get the benefit of data locality, which is critical when working with large data sets. We detail these important benefits later in this chapter.
Let’s look at a simple example. Assume you have five files, and each file contains two columns (a key and a value in Hadoop terms) that represent a city and the corresponding temperature recorded in that city for the various measurement days. Of course we’ve made this example very simple so it’s easy to follow. You can imagine that a real application won’t be quite so simple, as it’s likely to contain millions or even billions of rows, and they might not be neatly formatted rows at all; in fact, no matter how big or small the amount of data you need to analyze, the key principles we’re covering here remain the same. Either way, in this example, city is the key and temperature is the value.
Toronto, 20 Whitby, 25 New York, 22 Rome, 32 Toronto, 4 Rome, 33 New York, 18
Out of all the data we have collected, we want to find the maximum temperature for each city across all of the data files (note that each file might have the same city represented multiple times). Using the MapReduce framework, we can break this down into five map tasks, where each mapper works on one of the five files and the mapper task goes through the data and returns the maximum temperature for each city. For example, the results produced from one mapper task for the data above would look like this:
(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)
Let’s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results:
(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)
All five of these output streams would be fed into the reduce tasks, which combine the input results and output a single value for each city, producing a final result set as follows:
(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)
As an analogy, you can think of map and reduce tasks as the way a census was conducted in Roman times, where the census bureau would dispatch its people to each city in the empire. Each census taker in each city would be tasked to count the number of people in that city and then return their results to the capital city.
There, the results from each city would be reduced to a single count (sum of all cities) to determine the overall population of the empire. This mapping of people to cities, in parallel, and then combining the results (reducing) is much more efficient than sending a single person to count every person in the empire in a serial fashion.
假设您有 4 台机器,每台机器都有 4GB RAM 和双核 CPUs。
您可以将 YARN 呈现给能够分配和并行化工作负载的应用程序,例如 MapReduce,YARN 将响应它能够在 8 CPU 个内核上接受 16GB 的应用程序工作负载。
并非所有节点都需要相同,有些节点可以与 GPU 资源或更高的内存吞吐量一起使用,但对于任何单个 运行 应用程序,您始终会受到组中最小节点的限制。 .. 框架根据可用资源决定将代码部署到哪个节点,而不是你。当 NodeManager 与 HDFS 数据节点组合时(它们在同一台机器上 运行),您尝试读取文件的代码将被包含部分文件的机器尝试 运行需要。
基本上,将您的存储分成小块 (HDFS),提供一种将这些块读入完整文件 (MapReduce) 的方法,并使用一些处理引擎公平地分配该操作,或贪婪地分配到资源池中(YARN 的 Fair调度程序或容量调度程序)