Create labledpoints from Spark Dataframe & how to pass list of names to VectorAssembler

我还有其他问题 我正在尝试从数据框构建 labledPoints,其中我在列中具有特征和标签。这些特征都是带有 1/0 的布尔值。


#Using the code from above answer, 
#create a list of feature names from the column names of the dataframe
df_columns = []
for  c in df.columns:
    if c == 'is_item_return': continue

#using VectorAssembler for transformation, am using only first 4 columns names
assembler = VectorAssembler()

transformed = assembler.transform(df)

   #mapping also from above link
   from pyspark.mllib.regression import LabeledPoint
   from pyspark.sql.functions import col

new_df = transformed.select(col('is_item_return'), col("features")).map(lambda row: LabeledPoint(row.is_item_return, row.features))



有人可以帮助我理解如何将现有数据框的列名作为特征名称传递给 VectorAssembler 吗?

这里没有任何问题。您得到的是 SparseVector 的字符串表示形式,它准确地反映了您的输入:

  • 您取前五列 (assembler.setInputCols(df_columns[0:5])),输出向量的长度为 5
  • 因为示例输入的第一列不包含非零条目 indicesvalues 数组为空

为了说明这一点,让我们使用 Scala,它提供了有用的 toSparse / toDense 方法:

import org.apache.spark.mllib.linalg.Vectors

val v = Vectors.dense(Array(0.0, 0.0, 0.0, 0.0, 0.0))
// String = (5,[],[])

// String = [0.0,0.0,0.0,0.0,0.0]

PySpark 也是如此:

from pyspark.ml.feature import VectorAssembler

df = sc.parallelize([
    tuple([0.0] * 5),
    tuple([1.0] * 5), 
    (1.0, 0.0, 1.0, 0.0, 1.0),
    (0.0, 1.0, 0.0, 1.0, 0.0)

features = (VectorAssembler(inputCols=df.columns, outputCol="features")

features.show(4, False)

## +---------------------+
## |features             |
## +---------------------+
## |(5,[],[])            |
## |[1.0,1.0,1.0,1.0,1.0]|
## |[1.0,0.0,1.0,0.0,1.0]|
## |(5,[1,3],[1.0,1.0])  |
## +---------------------+


features.flatMap(lambda x: x).map(type).collect()

## [pyspark.mllib.linalg.SparseVector,
##  pyspark.mllib.linalg.DenseVector,
##  pyspark.mllib.linalg.DenseVector,
##  pyspark.mllib.linalg.SparseVector]