Spark:如何 运行 仅使用 LabeledPoint 的某些特征进行逻辑回归?
Spark: How to run logistic regression using only some features from LabeledPoint?
我在 上有一个 LabeledPoint
我想 运行 逻辑回归:
Data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] =
MapPartitionsRDD[3335] at map at <console>:44
使用代码:
val splits = Data.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val numIterations = 100
val model = LogisticRegressionWithSGD.train(training, numIterations)
我的问题是我不想使用 LabeledPoint
的所有功能,而只想使用其中的一部分。我有一个我不想使用的功能列表,例如:
LoF=List(223244,334453...
如何在逻辑回归中仅从 LabeledPoint
或 select 中获取我想使用的特征?
特征 selection 允许 selecting 最相关的特征用于模型构建。特征 selection 减小了向量的大小 space,进而降低了向量的任何后续操作的复杂性。 select 的特征数量可以使用保留的验证集进行调整。
完成您正在寻找的事情的一种方法是使用 ElementwiseProduct
。
ElementwiseProduct multiplies each input vector by a provided “weight” vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the Hadamard product between the input vector, v and transforming vector, w, to yield a result vector.
因此,如果我们将要保留的特征的权重设置为 1.0,将其他特征的权重设置为 0.0,我们可以说剩余的结果特征由原始向量的 ElementwiseProduct
和 0- 1 个权重向量将 select 我们需要的特征:
import org.apache.spark.mllib.feature.ElementwiseProduct
import org.apache.spark.mllib.linalg.Vectors
// Creating dummy LabeledPoint RDD
val data = sc.parallelize(Array(LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0,5.0,1.0)), LabeledPoint(1.0,Vectors.dense(4.0, 5.0, 6.0,1.0,2.0)),LabeledPoint(0.0,Vectors.dense(4.0, 2.0, 3.0,0.0,2.0))))
data.toDF.show
// +-----+--------------------+
// |label| features|
// +-----+--------------------+
// | 1.0|[1.0,0.0,3.0,5.0,...|
// | 1.0|[4.0,5.0,6.0,1.0,...|
// | 0.0|[4.0,2.0,3.0,0.0,...|
// +-----+--------------------+
// You'll need to know how many features you have, I have used 5 for the example
val numFeatures = 5
// The indices represent the features we want to keep
// Note : indices start with 0 so actually here you are keeping features 4 and 5.
val indices = List(3, 4).toArray
// Now we can create our weights vectors
val weights = Array.fill[Double](indices.size)(1)
// Create the sparse vector of the features we need to keep.
val transformingVector = Vectors.sparse(numFeatures, indices, weights)
// Init our vector transformer
val transformer = new ElementwiseProduct(transformingVector)
// Apply it to the data.
val transformedData = data.map(x => LabeledPoint(x.label,transformer.transform(x.features).toSparse))
transformedData.toDF.show
// +-----+-------------------+
// |label| features|
// +-----+-------------------+
// | 1.0|(5,[3,4],[5.0,1.0])|
// | 1.0|(5,[3,4],[1.0,2.0])|
// | 0.0| (5,[4],[2.0])|
// +-----+-------------------+
注:
- 您注意到我使用稀疏向量表示进行 space 优化。
- 特征是稀疏向量。
我在 上有一个 LabeledPoint
我想 运行 逻辑回归:
Data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] =
MapPartitionsRDD[3335] at map at <console>:44
使用代码:
val splits = Data.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val numIterations = 100
val model = LogisticRegressionWithSGD.train(training, numIterations)
我的问题是我不想使用 LabeledPoint
的所有功能,而只想使用其中的一部分。我有一个我不想使用的功能列表,例如:
LoF=List(223244,334453...
如何在逻辑回归中仅从 LabeledPoint
或 select 中获取我想使用的特征?
特征 selection 允许 selecting 最相关的特征用于模型构建。特征 selection 减小了向量的大小 space,进而降低了向量的任何后续操作的复杂性。 select 的特征数量可以使用保留的验证集进行调整。
完成您正在寻找的事情的一种方法是使用 ElementwiseProduct
。
ElementwiseProduct multiplies each input vector by a provided “weight” vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the Hadamard product between the input vector, v and transforming vector, w, to yield a result vector.
因此,如果我们将要保留的特征的权重设置为 1.0,将其他特征的权重设置为 0.0,我们可以说剩余的结果特征由原始向量的 ElementwiseProduct
和 0- 1 个权重向量将 select 我们需要的特征:
import org.apache.spark.mllib.feature.ElementwiseProduct
import org.apache.spark.mllib.linalg.Vectors
// Creating dummy LabeledPoint RDD
val data = sc.parallelize(Array(LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0,5.0,1.0)), LabeledPoint(1.0,Vectors.dense(4.0, 5.0, 6.0,1.0,2.0)),LabeledPoint(0.0,Vectors.dense(4.0, 2.0, 3.0,0.0,2.0))))
data.toDF.show
// +-----+--------------------+
// |label| features|
// +-----+--------------------+
// | 1.0|[1.0,0.0,3.0,5.0,...|
// | 1.0|[4.0,5.0,6.0,1.0,...|
// | 0.0|[4.0,2.0,3.0,0.0,...|
// +-----+--------------------+
// You'll need to know how many features you have, I have used 5 for the example
val numFeatures = 5
// The indices represent the features we want to keep
// Note : indices start with 0 so actually here you are keeping features 4 and 5.
val indices = List(3, 4).toArray
// Now we can create our weights vectors
val weights = Array.fill[Double](indices.size)(1)
// Create the sparse vector of the features we need to keep.
val transformingVector = Vectors.sparse(numFeatures, indices, weights)
// Init our vector transformer
val transformer = new ElementwiseProduct(transformingVector)
// Apply it to the data.
val transformedData = data.map(x => LabeledPoint(x.label,transformer.transform(x.features).toSparse))
transformedData.toDF.show
// +-----+-------------------+
// |label| features|
// +-----+-------------------+
// | 1.0|(5,[3,4],[5.0,1.0])|
// | 1.0|(5,[3,4],[1.0,2.0])|
// | 0.0| (5,[4],[2.0])|
// +-----+-------------------+
注:
- 您注意到我使用稀疏向量表示进行 space 优化。
- 特征是稀疏向量。