处理 CoordinateMatrix 时 MatrixEntry 不可迭代... pyspark MLlib

Question

我正在尝试在 CoordinateMatrix 上执行此行...

test = test.entries.map(lambda (i, j, v): (j, (i, v)))

Scala 中的等效项似乎有效但在 pyspark 中失败。我在执行该行时遇到的错误...

'MatrixEntry' object is not iterable

并确认我正在使用 CoordinateMatrix...

>>> test = test_coord.entries
>>> test.first()
>>> MatrixEntry(0, 0, 7.0)

有谁知道可能发生了什么？

Answer 1

假设test是一个CoordinatedMatrix，那么：

test.entries.map(lambda e: (e.j, (e.i, e.value)))

_{旁注：您不能在 lambda 函数中解包元组。所以 map(lambda (x, y, z): ) 在这种情况下不会起作用，即使它似乎不是失败的原因。}

示例：

test = CoordinateMatrix(sc.parallelize([(1,2,3), (4,5,6)]))
test.entries.collect()
# [MatrixEntry(1, 2, 3.0), MatrixEntry(4, 5, 6.0)]
test.entries.map(lambda e: (e.j, (e.i, e.value))).collect()
# [(2L, (1L, 3.0)), (5L, (4L, 6.0))]

处理 CoordinateMatrix 时 MatrixEntry 不可迭代... pyspark MLlib

MatrixEntry not iterable when processing CoordinateMatrix... pyspark MLlib

pyspark

apache-spark-mllib