使用 MapReduce 是否可以保证具有相同键的所有值都将进入相同的 reducer？

Question

我有一个正在处理的 MapReduce 项目（具体来说，我正在使用 Python 和库 MrJob 并计划运行ning 使用 Amazon 的 EMR）。这是一个总结我遇到的问题的例子：

我有数千 GB 的 json 文件，里面装满了客户数据。我需要去运行每个客户的每日、每周和每月报告 json line/input/object。

因此，对于我目前执行的地图步骤：

map_step(_, customer_json_object)
    c_uuid = customer_json_object.uuid
    if customer_json_object.time is in daily_time_range:
        yield "%s-%s" % (DAILY_CONSTANT, c_uuid), customer_json_object
    if customer_json_object.time is in weekly_time_range:
        yield "%s-%s" % (WEEKLY_CONSTANT, c_uuid), customer_json_object
    if customer_json_object.time is in monthly_time_range:
        yield "%s-%s" % (MONTHLY_CONSTANT, c_uuid), customer_json_object

然后是reducer

reducer_step(key, customer_info)
    report_type, c_uuid = key.split("-")
    yield None, Create_Report(report_type, customer_info)

我的问题是：

我能保证我所有具有相同键的数据（这里指的是特定客户和特定报告类型的所有数据）都将由同一个 reducer 处理吗？我的 Create_Report 不能分布在多个进程中，因此我需要一个进程处理一份报告所需的所有数据。

我担心如果一个键的值太多，那么它们可能会分散在减速器或其他东西中。然而，从我读到的内容来看，这就是它的工作原理。

非常感谢！！我刚刚意识到我需要在地图步骤中多次 yield，所以这是我的最后一块拼图。如果这能被弄清楚，那将是一个巨大的胜利，因为我无法进一步垂直扩展我的小服务器......

如果上面的代码不清楚，我有数千个 json 行客户（或真正的用户，没有人付钱给我）数据文件。我希望能够为这些数据创建报告，并且报告代码的生成方式不同，具体取决于每月、每周或每天。实际上，在此之前我也在对数据进行重复数据删除，但这是我的最后一步，实际生成输出。非常感谢您花时间阅读本文并提供帮助！！

Answer 1

在一般的 MapReduce 和 Phyton 库 MrJob 中，它适用于：

A reducer takes a key and the complete set of values for that key in the current step, and returns zero or more arbitrary (key, value) pairs as output.

来自： MrJob 文档 - https://pythonhosted.org/mrjob/guides/concepts.html#mapreduce-and-apache-hadoop

回到你的问题：

Am I guaranteed here that all my data with the same key ... will be handled by the same reducer?

是的，此外，属于同一键的所有值都被传递给减速器的同一调用。

使用 MapReduce 是否可以保证具有相同键的所有值都将进入相同的 reducer？

With MapReduce is it guaranteed that ALL values with the same key will go to the same reducer?

python

hadoop

mapreduce

bigdata

mrjob