预先确定RDD的分区数

Question

1)How to Pre-Determine the number of RDD partitions that will be created?
2)what all factors does partitioning of data depend on? Is it only the size of the data and way it is stored(compressed, sequence etc..)

为简单起见，假设我在 HDFS 中有一个 6GB 的文件存储为纯文本文件。

我的集群是配置如下的 EC2 集群，

1 master node - m3.xlarge(4 cores, 15GB Ram)

4 core nodes - m3.xlarge(4 cores , 15GB Ram each)

更新： 如果将相同的内容存储在 s3、HBase 或任何 NoSQL 中会怎样？

Answer 1

分区取决于文件类型。在您的情况下，由于它是一个 HDFS 文件，因此默认分区数是输入拆分数，这取决于您的 hadoop 设置。但是，如果您想要的只是一种理解其工作原理的方式。

来自HadoopRDD.getPartitions：

val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)

预先确定RDD的分区数

Predetermining number of partitions of RDD

amazon-s3

hdfs

apache-spark