工作如何在 Spark 中分配

How work gets distributed in Spark

星火版本：1.4.0 卡桑德拉版本：2.1.8

我正在使用 datastax Spark Cassandra 连接器桥接 Spark 和 Cassandra。我在 Spark 运行 6 个不同的工作人员中有 6 个节点。我有 2 个 Cassandra 节点协助这个。

我尝试了一个示例应用程序来计算列族中的行数（CassandraUtil.javaFunctions(sc).cassandraTable("keyspace","columnfamily").count() ).

现在，当我将这个单一作业分派给 master 时，作业运行在 Spark 集群的 2 个工作节点中（从事件时间轴获得）。

问题

我派遣了一份工作。为什么是由两个工人完成的？是不是一个工人在这里当主人？
我发现一名工人的反序列化时间非常长。其他工人完成工作的速度相当快（1 人用了 40 秒，2 人用了 1 秒）。你能解释一下吗？
两个工人似乎都与Cassandra建立了联系并返回了结果。所以，在我看来，两者都在做同样的工作。你能解释一下吗？
我仍然想知道 RDD 的实现在 Cassandra 的这个分布式领域中的位置。有人可以对此有所了解吗？多个工作人员如何知道他们必须在 Cassandra 的哪个分区上工作，如果它可以，比如说，在 6 个工作人员之间拆分 10k 个分区？是不是像，提取全部由一个工人完成，而处理由其中的 6 个工人完成？即使在那种情况下，执行逻辑在所有工作人员中保持不变（从 Cassandra 获取并处理）。 Spark 是如何做到这一点的？
想知道将 Spark 与 Cassandra 结合使用的真正优势。是在内存管理层面还是有一些其他的优势？

编辑

我添加了运行的图片。我只有 10 个不同的分区。这是一个简单的计数操作。

我想我的问题仍然是个谜。

如果您看到提供的附件，我想您会有所了解。这是提交给我的 spark master 的一份工作。想知道它如何运行在两个不同的执行者中。两个执行者都返回相同数量的字节。所以，这表明两者都已从 cassandra 获取了所有 10 个分区。如果这是它发生的方式，那么 spark 比 cassandra 能为我提供什么？或者，我是否必须以其他方式获取它，以便由两个不同的工作人员获取十个分区？

我建议您花几个小时阅读 Spark 和 C*。我在这个 post.

的底部挑选了一些推荐的 material

现在让我回答您的问题：

I dispatched a single job. Why it was done by two workers? Is it like one worker acts like a master here?

可能与资源可用性或作业中的分区数量有关（可能是后者）。

正如 Russ 所说 "Increase the parallelism of your job. Try increasing the number of partitions in your job. By splitting the work into smaller sets of data less information will have to be resident in memory at a given time. For a Spark Cassandra Connector job this would mean decreasing the split size variable."

要在 1.2 中调整它，请使用：

spark.cassandra.input.split.size spark.cassandra.output.batch.size.rows spark.cassandra.output.batch.size.bytes

在较新的版本中，您还拥有： spark.cassandra.output.throughput_mb_per_sec

I found the deserialisation time to be very high in one worker. Other worker completed the job pretty fast( 1 took 40 seconds and 2 took 1 second). Can you throw some light on this?

从 Kay who actually added the feature 到网络 ui:

“反序列化任务的时间相对较长为短期工作分配时间，了解时间高的时间会有所帮助开发人员意识到他们应该尝试减小闭包大小（例如，通过包含任务描述中的数据较少）。"

Both the workers seems to have established a connection with Cassandra and has returned a result. So , in my view, both are doing the same job. Can you throw some light on this?

Spark 并行工作。因为这是一种分布式计算范例，您可以通过启动并行工作的执行程序来利用多个节点和多个内核。两个执行程序都会从 C* 中提取数据，但它们会根据分区提取不同的数据。

有关详细信息，请参阅一些介绍视频。

I am still wondering where the implementation of RDD will fit in this distributed realm with Cassandra . Can someone throw some light on this? How does multiple workers know which partition of Cassandra they have to work on , if it can , say ,split 10k partitions among 6 workers? Is it like ,fetching is all done by one worker and processing is done by 6 of them? Even in that case, execution logic remains the same in all workers(fetch from Cassandra and process). How does Spark do this?

每个人都将根据分区获取和处理自己的数据。

要获取有关如何对作业进行分区的信息，请使用：

rdd.partitions

如果您将 Spark 和 Cassandra 放在一起，如 DSE 中的情况，您将获得数据局部性的优势（无需将数据从 c* 流式传输到 spark worker）。

Would like to know the real advantage of using Spark with Cassandra. Is it at memory management level or it has some other advantages?

这里可能太多了，看推荐的reading/viewing。大人物是 sql 用于批处理和流分析的样式查询（连接、聚合、groupby 等）+ 使用 MLLIB 的精美统计建模、使用 graphx 的分析图等

这里有一些不错的 material 可以帮助您快速上手：

这是 Russ 关于 Spark 和 C* 的可能性的高级演示： http://www.slideshare.net/planetcassandra/escape-from-hadoop

OReily 网络研讨会与来自 DataBricks 的 Sameer 关于 DSE 如何与 Spark 集成： http://www.oreilly.com/pub/e/3234

连接器如何读取数据： https://academy.datastax.com/demos/how-spark-cassandra-connector-reads-data

关于故障排除 spark 的关键 post 将在您真正尝试让东西正常工作时提供帮助。这些将回答您的大部分 opps/perf 问题： http://www.datastax.com/dev/blog/common-spark-troubleshooting

https://databricks.com/blog/2015/06/16/zen-and-the-art-of-spark-maintenance-with-cassandra.html

来自 Sandy 的两个相似且有价值的 posts（非特定于 c*）： http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

工作如何在 Spark 中分配

How work gets distributed in Spark

cassandra

cassandra-2.0

apache-spark

spark-cassandra-connector