pyspark 使用 partitionby 对数据进行分区

pyspark partitioning data using partitionby

我了解 partitionBy 函数对我的数据进行分区。如果我使用 rdd.partitionBy(100)，它会将我的数据按键分成 100 个部分。即与相似键关联的数据将组合在一起

我的理解正确吗？
是否建议分区数等于分区数可用内核？这会使处理更有效率吗？
如果我的数据不是键值格式怎么办。我还能使用这个功能吗？
假设我的数据是 serial_number_of_student、student_name。在这个情况下我可以按 student_name 而不是 serial_number?

不完全是。 Spark，包括 PySpark，。排除相同的键，分配给单个分区的键之间没有实际相似性。
这里没有简单的答案。一切都取决于数据量和可用资源。或分区数太少会降低性能。

Some resources claim the number of partitions should around twice as large as the number of available cores. From the other hand a single partition typically shouldn't contain more than 128MB and a single shuffle block cannot be larger than 2GB (See SPARK-6235).

最后，您必须纠正潜在的数据偏差。如果某些键在您的数据集中过多，可能会导致资源使用不理想和潜在的失败。
没有，或者至少没有直接。您可以使用 keyBy 方法将 RDD 转换为所需的格式。此外，任何 Python 对象都可以被视为 键值对 ，只要它实现了所需的方法，使其表现得像长度等于 2 的 Iterable 。参见
这取决于类型。只要密钥是 hashable* 就可以。通常这意味着它必须是不可变的结构，并且它包含的所有值也必须是不可变的。例如但整数 tuple 是。

An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() method). Hashable objects which compare equal must have the same hash value.

我最近使用了 partitionby。我所做的是重组我的数据，以便所有我想放在同一个分区中的数据都具有相同的键，而这又是数据中的一个值。我的数据是一个字典列表，我用 dictionary.Initially 中的键将其转换为元组，partitionby 没有在同一分区中保留相同的键。但后来我意识到键是字符串。我将它们转换为 int。但问题仍然存在。人数非常多。然后我将这些数字映射到小数值并且它起作用了。所以我的收获是键必须是小整数。

pyspark 使用 partitionby 对数据进行分区

pyspark partitioning data using partitionby

python

partitioning

apache-spark

rdd

pyspark