如何估算 MapReduce 作业时间

How to estimate MapReduce job time

我有一个 Mapreduce 程序,当 运行 处理 1% 的数据集时,这是它花费的时间:

Job Counters
    Launched map tasks=3
    Launched reduce tasks=45
    Data-local map tasks=1
    Rack-local map tasks=2
    Total time spent by all maps in occupied slots (ms)=29338
    Total time spent by all reduces in occupied slots (ms)=200225
    Total time spent by all map tasks (ms)=29338
    Total time spent by all reduce tasks (ms)=200225
    Total vcore-seconds taken by all map tasks=29338
    Total vcore-seconds taken by all reduce tasks=200225
    Total megabyte-seconds taken by all map tasks=30042112
    Total megabyte-seconds taken by all reduce tasks=205030400

我如何推断才能知道分析 100% 的数据将花费的时间? 我的理由是它会多花 100 倍,因为 1% 是一个块,但是当 运行 在 100% 上它实际上多花 134 倍。

100%数据的时间

Job Counters
    Launched map tasks=2113
    Launched reduce tasks=45
    Data-local map tasks=1996
    Rack-local map tasks=117
    Total time spent by all maps in occupied slots (ms)=26800451
    Total time spent by all reduces in occupied slots (ms)=3607607
    Total time spent by all map tasks (ms)=26800451
    Total time spent by all reduce tasks (ms)=3607607
    Total vcore-seconds taken by all map tasks=26800451
    Total vcore-seconds taken by all reduce tasks=3607607
    Total megabyte-seconds taken by all map tasks=27443661824
    Total megabyte-seconds taken by all reduce tasks=3694189568

根据其在一小部分数据上的性能来预测 map reduce 性能并不容易。如果您查看 1% 运行 的日志,它使用了 45 个减速器。相同数量的减速器仍然用于 100% 的数据。这意味着 reducer 用于处理洗牌和排序阶段的完整输出的时间量是非线性的。

已经开发了一些数学模型来预测 map reduce 性能。

这是其中一篇研究论文,提供了更多关于 map reduce 性能的见解。

http://personal.denison.edu/~bressoud/graybressoudmcurcsm2012.pdf

希望此信息对您有所帮助。

如前所述,预测 MapReduce 作业的运行时间并非易事。 问题是作业的执行时间是由最后一个并行任务的完成时间定义的。任务的执行时间取决于其运行的硬件、并发工作负载、数据倾斜等...

杜克大学的 Starfish project 可能值得一看。它包括 Hadoop 作业的性能模型,可以调整作业配置,以及一些易于调试的漂亮可视化功能。