Spark、Scala、DataFrame:创建特征向量
Spark, Scala, DataFrame: create feature vectors
我有一个 DataFrame
如下所示:
userID, category, frequency
1,cat1,1
1,cat2,3
1,cat9,5
2,cat4,6
2,cat9,2
2,cat10,1
3,cat1,5
3,cat7,16
3,cat8,2
不同类别的数量是 10,我想为每个 userID
创建一个特征向量,并用零填充缺失的类别。
所以输出会是这样的:
userID,feature
1,[1,3,0,0,0,0,0,0,5,0]
2,[0,0,0,6,0,0,0,0,2,1]
3,[5,0,0,0,0,0,16,2,0,0]
这只是一个说明性示例,实际上我有大约 200,000 个唯一用户 ID 和 300 个唯一类别。
创建特征的最有效方法是什么 DataFrame
?
假设:
val cs: SparkContext
val sc: SQLContext
val cats: DataFrame
其中 userId
和 frequency
是 bigint
列,对应于 scala.Long
我们正在创建中间映射RDD
:
val catMaps = cats.rdd
.groupBy(_.getAs[Long]("userId"))
.map { case (id, rows) => id -> rows
.map { row => row.getAs[String]("category") -> row.getAs[Long]("frequency") }
.toMap
}
然后按字典顺序收集所有呈现的类别
val catNames = cs.broadcast(catMaps.map(_._2.keySet).reduce(_ union _).toArray.sorted)
或手动创建
val catNames = cs.broadcast(1 to 10 map {n => s"cat$n"} toArray)
最后,我们将地图转换为具有 0 值的数组以表示不存在的值
import sc.implicits._
val catArrays = catMaps
.map { case (id, catMap) => id -> catNames.value.map(catMap.getOrElse(_, 0L)) }
.toDF("userId", "feature")
现在 catArrays.show()
打印类似
的内容
+------+--------------------+
|userId| feature|
+------+--------------------+
| 2|[0, 1, 0, 6, 0, 0...|
| 1|[1, 0, 3, 0, 0, 0...|
| 3|[5, 0, 0, 0, 16, ...|
+------+--------------------+
这可能不是数据帧最优雅的解决方案,因为我对 spark 的这个领域几乎不熟悉。
请注意,您可以手动创建 catNames
来为缺失的 cat3
、cat5
、...
添加零
还要注意,否则catMaps
RDD被操作两次,你可能要.persist()
它
更多DataFrame
中心解决方案:
import org.apache.spark.ml.feature.VectorAssembler
val df = sc.parallelize(Seq(
(1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), (2, "cat4", 6),
(2, "cat9", 2), (2, "cat10", 1), (3, "cat1", 5), (3, "cat7", 16),
(3, "cat8", 2))).toDF("userID", "category", "frequency")
// Create a sorted array of categories
val categories = df
.select($"category")
.distinct.map(_.getString(0))
.collect
.sorted
// Prepare vector assemble
val assembler = new VectorAssembler()
.setInputCols(categories)
.setOutputCol("features")
// Aggregation expressions
val exprs = categories.map(
c => sum(when($"category" === c, $"frequency").otherwise(lit(0))).alias(c))
val transformed = assembler.transform(
df.groupBy($"userID").agg(exprs.head, exprs.tail: _*))
.select($"userID", $"features")
和一个 UDAF 替代方案:
import org.apache.spark.sql.expressions.{
MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.types.{
StructType, ArrayType, DoubleType, IntegerType}
import scala.collection.mutable.WrappedArray
class VectorAggregate (n: Int) extends UserDefinedAggregateFunction {
def inputSchema = new StructType()
.add("i", IntegerType)
.add("v", DoubleType)
def bufferSchema = new StructType().add("buff", ArrayType(DoubleType))
def dataType = new VectorUDT()
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, Array.fill(n)(0.0))
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val i = input.getInt(0)
val v = input.getDouble(1)
val buff = buffer.getAs[WrappedArray[Double]](0)
buff(i) += v
buffer.update(0, buff)
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getAs[WrappedArray[Double]](0)
val buff2 = buffer2.getAs[WrappedArray[Double]](0)
for ((x, i) <- buff2.zipWithIndex) {
buff1(i) += x
}
buffer1.update(0, buff1)
}
def evaluate(buffer: Row) = Vectors.dense(
buffer.getAs[Seq[Double]](0).toArray)
}
使用示例:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("category_idx")
.fit(df)
val indexed = indexer.transform(df)
.withColumn("category_idx", $"category_idx".cast("integer"))
.withColumn("frequency", $"frequency".cast("double"))
val n = indexer.labels.size + 1
val transformed = indexed
.groupBy($"userID")
.agg(new VectorAggregate(n)($"category_idx", $"frequency").as("vec"))
transformed.show
// +------+--------------------+
// |userID| vec|
// +------+--------------------+
// | 1|[1.0,5.0,0.0,3.0,...|
// | 2|[0.0,2.0,0.0,0.0,...|
// | 3|[5.0,0.0,16.0,0.0...|
// +------+--------------------+
在这种情况下,值的顺序由 indexer.labels
:
定义
indexer.labels
// Array[String] = Array(cat1, cat9, cat7, cat2, cat8, cat4, cat10)
实际上我更喜欢Odomontois的解决方案,所以这些主要供参考。
鉴于您的意见:
val df = Seq((1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5),
(2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1),
(3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2))
.toDF("userID", "category", "frequency")
df.show
+------+--------+---------+
|userID|category|frequency|
+------+--------+---------+
| 1| cat1| 1|
| 1| cat2| 3|
| 1| cat9| 5|
| 2| cat4| 6|
| 2| cat9| 2|
| 2| cat10| 1|
| 3| cat1| 5|
| 3| cat7| 16|
| 3| cat8| 2|
+------+--------+---------+
就运行:
val pivoted = df.groupBy("userID").pivot("category").avg("frequency")
val dfZeros = pivoted.na.fill(0)
dzZeros.show
+------+----+-----+----+----+----+----+----+
|userID|cat1|cat10|cat2|cat4|cat7|cat8|cat9|
+------+----+-----+----+----+----+----+----+
| 1| 1.0| 0.0| 3.0| 0.0| 0.0| 0.0| 5.0|
| 3| 5.0| 0.0| 0.0| 0.0|16.0| 2.0| 0.0|
| 2| 0.0| 1.0| 0.0| 6.0| 0.0| 0.0| 2.0|
+------+----+-----+----+----+----+----+----+
最后,使用VectorAssembler创建一个org.apache.spark.ml.linalg.Vector
注意:我还没有检查这方面的表现...
编辑:可能更复杂,但可能更高效!
def toSparseVectorUdf(size: Int) = udf[Vector, Seq[Row]] {
(data: Seq[Row]) => {
val indices = data.map(_.getDouble(0).toInt).toArray
val values = data.map(_.getInt(1).toDouble).toArray
Vectors.sparse(size, indices, values)
}
}
val indexer = new StringIndexer().setInputCol("category").setOutputCol("idx")
val indexerModel = indexer.fit(df)
val totalCategories = indexerModel.labels.size
val dataWithIndices = indexerModel.transform(df)
val data = dataWithIndices.groupBy("userId").agg(sort_array(collect_list(struct($"idx", $"frequency".as("val")))).as("data"))
val dataWithFeatures = data.withColumn("features", toSparseVectorUdf(totalCategories)($"data")).drop("data")
dataWithFeatures.show(false)
+------+--------------------------+
|userId|features |
+------+--------------------------+
|1 |(7,[0,1,3],[1.0,5.0,3.0]) |
|3 |(7,[0,2,4],[5.0,16.0,2.0])|
|2 |(7,[1,5,6],[2.0,6.0,1.0]) |
+------+--------------------------+
注意:StringIndexer 将按频率对类别进行排序 => 最常见的类别将位于 indexerModel.labels
中的索引=0。如果您愿意,可以随意使用自己的映射并将其直接传递给 toSparseVectorUdf
.
我有一个 DataFrame
如下所示:
userID, category, frequency
1,cat1,1
1,cat2,3
1,cat9,5
2,cat4,6
2,cat9,2
2,cat10,1
3,cat1,5
3,cat7,16
3,cat8,2
不同类别的数量是 10,我想为每个 userID
创建一个特征向量,并用零填充缺失的类别。
所以输出会是这样的:
userID,feature
1,[1,3,0,0,0,0,0,0,5,0]
2,[0,0,0,6,0,0,0,0,2,1]
3,[5,0,0,0,0,0,16,2,0,0]
这只是一个说明性示例,实际上我有大约 200,000 个唯一用户 ID 和 300 个唯一类别。
创建特征的最有效方法是什么 DataFrame
?
假设:
val cs: SparkContext
val sc: SQLContext
val cats: DataFrame
其中 userId
和 frequency
是 bigint
列,对应于 scala.Long
我们正在创建中间映射RDD
:
val catMaps = cats.rdd
.groupBy(_.getAs[Long]("userId"))
.map { case (id, rows) => id -> rows
.map { row => row.getAs[String]("category") -> row.getAs[Long]("frequency") }
.toMap
}
然后按字典顺序收集所有呈现的类别
val catNames = cs.broadcast(catMaps.map(_._2.keySet).reduce(_ union _).toArray.sorted)
或手动创建
val catNames = cs.broadcast(1 to 10 map {n => s"cat$n"} toArray)
最后,我们将地图转换为具有 0 值的数组以表示不存在的值
import sc.implicits._
val catArrays = catMaps
.map { case (id, catMap) => id -> catNames.value.map(catMap.getOrElse(_, 0L)) }
.toDF("userId", "feature")
现在 catArrays.show()
打印类似
+------+--------------------+
|userId| feature|
+------+--------------------+
| 2|[0, 1, 0, 6, 0, 0...|
| 1|[1, 0, 3, 0, 0, 0...|
| 3|[5, 0, 0, 0, 16, ...|
+------+--------------------+
这可能不是数据帧最优雅的解决方案,因为我对 spark 的这个领域几乎不熟悉。
请注意,您可以手动创建 catNames
来为缺失的 cat3
、cat5
、...
还要注意,否则catMaps
RDD被操作两次,你可能要.persist()
它
更多DataFrame
中心解决方案:
import org.apache.spark.ml.feature.VectorAssembler
val df = sc.parallelize(Seq(
(1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), (2, "cat4", 6),
(2, "cat9", 2), (2, "cat10", 1), (3, "cat1", 5), (3, "cat7", 16),
(3, "cat8", 2))).toDF("userID", "category", "frequency")
// Create a sorted array of categories
val categories = df
.select($"category")
.distinct.map(_.getString(0))
.collect
.sorted
// Prepare vector assemble
val assembler = new VectorAssembler()
.setInputCols(categories)
.setOutputCol("features")
// Aggregation expressions
val exprs = categories.map(
c => sum(when($"category" === c, $"frequency").otherwise(lit(0))).alias(c))
val transformed = assembler.transform(
df.groupBy($"userID").agg(exprs.head, exprs.tail: _*))
.select($"userID", $"features")
和一个 UDAF 替代方案:
import org.apache.spark.sql.expressions.{
MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.types.{
StructType, ArrayType, DoubleType, IntegerType}
import scala.collection.mutable.WrappedArray
class VectorAggregate (n: Int) extends UserDefinedAggregateFunction {
def inputSchema = new StructType()
.add("i", IntegerType)
.add("v", DoubleType)
def bufferSchema = new StructType().add("buff", ArrayType(DoubleType))
def dataType = new VectorUDT()
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, Array.fill(n)(0.0))
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val i = input.getInt(0)
val v = input.getDouble(1)
val buff = buffer.getAs[WrappedArray[Double]](0)
buff(i) += v
buffer.update(0, buff)
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getAs[WrappedArray[Double]](0)
val buff2 = buffer2.getAs[WrappedArray[Double]](0)
for ((x, i) <- buff2.zipWithIndex) {
buff1(i) += x
}
buffer1.update(0, buff1)
}
def evaluate(buffer: Row) = Vectors.dense(
buffer.getAs[Seq[Double]](0).toArray)
}
使用示例:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("category_idx")
.fit(df)
val indexed = indexer.transform(df)
.withColumn("category_idx", $"category_idx".cast("integer"))
.withColumn("frequency", $"frequency".cast("double"))
val n = indexer.labels.size + 1
val transformed = indexed
.groupBy($"userID")
.agg(new VectorAggregate(n)($"category_idx", $"frequency").as("vec"))
transformed.show
// +------+--------------------+
// |userID| vec|
// +------+--------------------+
// | 1|[1.0,5.0,0.0,3.0,...|
// | 2|[0.0,2.0,0.0,0.0,...|
// | 3|[5.0,0.0,16.0,0.0...|
// +------+--------------------+
在这种情况下,值的顺序由 indexer.labels
:
indexer.labels
// Array[String] = Array(cat1, cat9, cat7, cat2, cat8, cat4, cat10)
实际上我更喜欢Odomontois的解决方案,所以这些主要供参考。
鉴于您的意见:
val df = Seq((1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5),
(2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1),
(3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2))
.toDF("userID", "category", "frequency")
df.show
+------+--------+---------+
|userID|category|frequency|
+------+--------+---------+
| 1| cat1| 1|
| 1| cat2| 3|
| 1| cat9| 5|
| 2| cat4| 6|
| 2| cat9| 2|
| 2| cat10| 1|
| 3| cat1| 5|
| 3| cat7| 16|
| 3| cat8| 2|
+------+--------+---------+
就运行:
val pivoted = df.groupBy("userID").pivot("category").avg("frequency")
val dfZeros = pivoted.na.fill(0)
dzZeros.show
+------+----+-----+----+----+----+----+----+
|userID|cat1|cat10|cat2|cat4|cat7|cat8|cat9|
+------+----+-----+----+----+----+----+----+
| 1| 1.0| 0.0| 3.0| 0.0| 0.0| 0.0| 5.0|
| 3| 5.0| 0.0| 0.0| 0.0|16.0| 2.0| 0.0|
| 2| 0.0| 1.0| 0.0| 6.0| 0.0| 0.0| 2.0|
+------+----+-----+----+----+----+----+----+
最后,使用VectorAssembler创建一个org.apache.spark.ml.linalg.Vector
注意:我还没有检查这方面的表现...
编辑:可能更复杂,但可能更高效!
def toSparseVectorUdf(size: Int) = udf[Vector, Seq[Row]] {
(data: Seq[Row]) => {
val indices = data.map(_.getDouble(0).toInt).toArray
val values = data.map(_.getInt(1).toDouble).toArray
Vectors.sparse(size, indices, values)
}
}
val indexer = new StringIndexer().setInputCol("category").setOutputCol("idx")
val indexerModel = indexer.fit(df)
val totalCategories = indexerModel.labels.size
val dataWithIndices = indexerModel.transform(df)
val data = dataWithIndices.groupBy("userId").agg(sort_array(collect_list(struct($"idx", $"frequency".as("val")))).as("data"))
val dataWithFeatures = data.withColumn("features", toSparseVectorUdf(totalCategories)($"data")).drop("data")
dataWithFeatures.show(false)
+------+--------------------------+
|userId|features |
+------+--------------------------+
|1 |(7,[0,1,3],[1.0,5.0,3.0]) |
|3 |(7,[0,2,4],[5.0,16.0,2.0])|
|2 |(7,[1,5,6],[2.0,6.0,1.0]) |
+------+--------------------------+
注意:StringIndexer 将按频率对类别进行排序 => 最常见的类别将位于 indexerModel.labels
中的索引=0。如果您愿意,可以随意使用自己的映射并将其直接传递给 toSparseVectorUdf
.