spark mllib 将函数应用于 rowMatrix 的所有元素
spark mllib apply function to all the elements of a rowMatrix
我有一个 rowMatrix xw
scala> xw
res109: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@8e74950
并且我想对其每个元素应用一个函数:
f(x)=exp(-x*x)
矩阵的元素类型可以形象化为:
scala> xw.rows.first
res110: org.apache.spark.mllib.linalg.Vector = [0.008930720313311474,0.017169380001300985,-0.013414238595719104,0.02239106636801034,0.023009502628798143,0.02891937604244297,0.03378470969100948,0.03644030110678057,0.0031586143217048825,0.011230244437457062,0.00477455053405408,0.020251682490519785,-0.005429788421130285,0.011578489275815267,0.0019301805575977788,0.022513736483645713,0.009475039307158668,0.019457912132044935,0.019209006632742498,-0.029811133879879596]
我的主要问题是我不能在矢量上使用地图
scala> xw.rows.map(row => row.map(e => breeze.numerics.exp(e)))
<console>:44: error: value map is not a member of org.apache.spark.mllib.linalg.Vector
xw.rows.map(row => row.map(e => breeze.numerics.exp(e)))
^
scala>
我该如何解决?
这是假设您知道自己确实有一个 DenseVector
(看起来确实如此)。您可以在具有地图的矢量上调用 toArray
,然后使用 Vectors.dense
:
转换回 DenseVector
xw.rows.map{row => Vectors.dense(row.toArray.map{e => breeze.numerics.exp(e)})}
您也可以在 SparseVector
上执行此操作;它在数学上是正确的,但转换为数组可能效率极低。另一种选择是调用 row.copy
然后使用 foreachActive
,这对密集和稀疏向量都有意义。但是 copy
可能无法针对您正在使用的特定 Vector
class 实现,而且如果您不知道向量的类型,则无法改变数据。如果你真的需要支持稀疏和密集向量,我会做类似的事情:
xw.rows.map{
case denseVec: DenseVector =>
Vectors.dense(denseVec.toArray.map{e => breeze.numerics.exp(e)})}
case sparseVec: SparseVector =>
//we only need to update values of the sparse vector -- the indices remain the same
val newValues: Array[Double] = sparseVec.values.map{e => breeze.numerics.exp(e)}
Vectors.sparse(sparseVec.size, sparseVec.indices, newValues)
}
我有一个 rowMatrix xw
scala> xw
res109: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@8e74950
并且我想对其每个元素应用一个函数:
f(x)=exp(-x*x)
矩阵的元素类型可以形象化为:
scala> xw.rows.first
res110: org.apache.spark.mllib.linalg.Vector = [0.008930720313311474,0.017169380001300985,-0.013414238595719104,0.02239106636801034,0.023009502628798143,0.02891937604244297,0.03378470969100948,0.03644030110678057,0.0031586143217048825,0.011230244437457062,0.00477455053405408,0.020251682490519785,-0.005429788421130285,0.011578489275815267,0.0019301805575977788,0.022513736483645713,0.009475039307158668,0.019457912132044935,0.019209006632742498,-0.029811133879879596]
我的主要问题是我不能在矢量上使用地图
scala> xw.rows.map(row => row.map(e => breeze.numerics.exp(e)))
<console>:44: error: value map is not a member of org.apache.spark.mllib.linalg.Vector
xw.rows.map(row => row.map(e => breeze.numerics.exp(e)))
^
scala>
我该如何解决?
这是假设您知道自己确实有一个 DenseVector
(看起来确实如此)。您可以在具有地图的矢量上调用 toArray
,然后使用 Vectors.dense
:
DenseVector
xw.rows.map{row => Vectors.dense(row.toArray.map{e => breeze.numerics.exp(e)})}
您也可以在 SparseVector
上执行此操作;它在数学上是正确的,但转换为数组可能效率极低。另一种选择是调用 row.copy
然后使用 foreachActive
,这对密集和稀疏向量都有意义。但是 copy
可能无法针对您正在使用的特定 Vector
class 实现,而且如果您不知道向量的类型,则无法改变数据。如果你真的需要支持稀疏和密集向量,我会做类似的事情:
xw.rows.map{
case denseVec: DenseVector =>
Vectors.dense(denseVec.toArray.map{e => breeze.numerics.exp(e)})}
case sparseVec: SparseVector =>
//we only need to update values of the sparse vector -- the indices remain the same
val newValues: Array[Double] = sparseVec.values.map{e => breeze.numerics.exp(e)}
Vectors.sparse(sparseVec.size, sparseVec.indices, newValues)
}