MapReduce 作业(用 python 编写)运行 在 EMR 上慢
MapReduce Job (written in python) run slow on EMR
我正在尝试使用 python 的 MRJob 包编写 MapReduce 作业。该作业处理存储在 S3 中的约 36,000 个文件。每个文件大约 2MB。当我 运行 在本地作业(将 S3 存储桶下载到我的计算机)时,运行 大约需要 1 小时。然而,当我尝试在 EMR 上 运行 它时,它需要更长的时间(我在 8 小时时停止了它,它在映射器中完成了 10%)。我在下面附上了我的 mapper_init 和映射器的代码。有谁知道什么会导致这样的问题?有谁知道如何修理它?我还应该注意,当我将输入限制为 100 个文件的样本时,它工作正常。
def mapper_init(self):
"""
Set class variables that will be useful to our mapper:
filename: the path and filename to the current recipe file
previous_line: The line previously parsed. We need this because the
ingredient name is in the line after the tag
"""
#self.filename = os.environ["map_input_file"] # Not currently used
self.previous_line = "None yet"
# Determining if an item is in a list is O(n) while determining if an
# item is in a set is O(1)
self.stopwords = set(stopwords.words('english'))
self.stopwords = set(self.stopwords_list)
def mapper(self, _, line):
"""
Takes a line from an html file and yields ingredient words from it
Given a line of input from an html file, we check to see if it
contains the identifier that it is an ingredient. Due to the
formatting of our html files from allrecipes.com, the ingredient name
is actually found on the following line. Therefore, we save the
current line so that it can be referenced in the next pass of the
function to determine if we are on an ingredient line.
:param line: a line of text from the html file as a str
:yield: a tuple containing each word in the ingredient as well as a
counter for each word. The counter is not currently being used,
but is left in for future development. e.g. "chicken breast" would
yield "chicken" and "breast"
"""
# TODO is there a better way to get the tag?
if re.search(r'span class="ingredient-name" id="lblIngName"',
self.previous_line):
self.previous_line = line
line = self.process_text(line)
line_list = set(line.split())
for word in line_list:
if word not in self.stopwords:
yield (word, 1)
else:
self.previous_line = line
yield ('', 0)
问题是您有更多的小文件。添加 bootstrap 使用 s3distcp 将文件复制到 EMR 的步骤。在使用 s3distcp 时尝试将小文件聚合到 ~128MB 文件中。
Hadoop 不适用于大量小文件。
由于您是手动将文件下载到您的计算机,运行因此 运行 速度更快。
使用 S3distCP 将文件复制到 EMR 后,使用来自 HDFS 的文件。
我正在尝试使用 python 的 MRJob 包编写 MapReduce 作业。该作业处理存储在 S3 中的约 36,000 个文件。每个文件大约 2MB。当我 运行 在本地作业(将 S3 存储桶下载到我的计算机)时,运行 大约需要 1 小时。然而,当我尝试在 EMR 上 运行 它时,它需要更长的时间(我在 8 小时时停止了它,它在映射器中完成了 10%)。我在下面附上了我的 mapper_init 和映射器的代码。有谁知道什么会导致这样的问题?有谁知道如何修理它?我还应该注意,当我将输入限制为 100 个文件的样本时,它工作正常。
def mapper_init(self):
"""
Set class variables that will be useful to our mapper:
filename: the path and filename to the current recipe file
previous_line: The line previously parsed. We need this because the
ingredient name is in the line after the tag
"""
#self.filename = os.environ["map_input_file"] # Not currently used
self.previous_line = "None yet"
# Determining if an item is in a list is O(n) while determining if an
# item is in a set is O(1)
self.stopwords = set(stopwords.words('english'))
self.stopwords = set(self.stopwords_list)
def mapper(self, _, line):
"""
Takes a line from an html file and yields ingredient words from it
Given a line of input from an html file, we check to see if it
contains the identifier that it is an ingredient. Due to the
formatting of our html files from allrecipes.com, the ingredient name
is actually found on the following line. Therefore, we save the
current line so that it can be referenced in the next pass of the
function to determine if we are on an ingredient line.
:param line: a line of text from the html file as a str
:yield: a tuple containing each word in the ingredient as well as a
counter for each word. The counter is not currently being used,
but is left in for future development. e.g. "chicken breast" would
yield "chicken" and "breast"
"""
# TODO is there a better way to get the tag?
if re.search(r'span class="ingredient-name" id="lblIngName"',
self.previous_line):
self.previous_line = line
line = self.process_text(line)
line_list = set(line.split())
for word in line_list:
if word not in self.stopwords:
yield (word, 1)
else:
self.previous_line = line
yield ('', 0)
问题是您有更多的小文件。添加 bootstrap 使用 s3distcp 将文件复制到 EMR 的步骤。在使用 s3distcp 时尝试将小文件聚合到 ~128MB 文件中。
Hadoop 不适用于大量小文件。
由于您是手动将文件下载到您的计算机,运行因此 运行 速度更快。
使用 S3distCP 将文件复制到 EMR 后,使用来自 HDFS 的文件。