RDD转多维数组

Question

我正在使用 spark 的 python API，我发现一些矩阵运算具有挑战性。我的 RDD 是长度为 n（行向量）的一维列表。是否可以将其重塑为大小为 sq_root(n) x Sq_root(n).

的 matrix/multidimensional 数组

例如，

Vec=[1,2,3,4,5,6,7,8,9]

和所需的输出 3 x 3=

[[1,2,3]
[4,5,6]
[7,8,9]]

在 numpy 中是否有等同于 reshape 的东西？

条件： n（>5000 万）很大，因此排除了使用 .collect() 的可能性，这个过程可以在多个线程上进行运行吗？

Answer 1

像这样的东西应该可以解决问题：

rdd = sc.parallelize(xrange(1, 10))
nrow = int(rdd.count() ** 0.5) # Compute number of rows

rows = (rdd.
   zipWithIndex(). # Add index, we assume that data is sorted
   groupBy(lambda (x, i): i / nrow). # Group by row
   # Order by column and drop index
   mapValues(lambda vals: [x for (x, i) in sorted(vals, key=lambda (x, i): i)])))

您可以添加：

from pyspark.mllib.linalg import DenseVector
rows.mapValues(DenseVector)

如果你想要合适的载体。

RDD转多维数组

RDD to multidimensional array

python

apache-spark

pyspark