使用 Spark 规范化列
Normalize column with Spark
我有一个包含三列的数据文件,我想规范化最后一列以将 ALS 与 ML(Spark 和 Scala)一起应用,我该怎么做?
这是我的 Dataframe
:
的摘录
val view_df = spark.createDataFrame(view_RDD, viewSchema)
val viewdd = view_df.withColumn("userIdTemp", view_df("userId").cast(IntegerType)).drop("userId")
.withColumnRenamed("userIdTemp", "userId")
.withColumn("productIdTemp", view_df("productId").cast(IntegerType)).drop("productId")
.withColumnRenamed("productIdTemp", "productId")
.withColumn("viewTemp", view_df("view").cast(FloatType)).drop("view")
.withColumnRenamed("viewTemp", "view")`
可以先使用StandardScaler
is usually what you want to do when there is any scaling/normalization to be done. However, in this case there is only a single column to scale and it's not of Vector
type (but Float
). Since the StandardScaler
only works on Vectors
, a VectorAssembler
,但Vector
需要在缩放后重新转换为Float
。
在这种情况下,更简单的方法是自己动手。首先获取列的均值和标准差,然后执行缩放。可以在 view
列上完成,如下所示:
val (mean_view, std_view) = viewdd.select(mean("view"), stddev("view"))
.as[(Double, Double)]
.first()
viewdd.withColumn("view_scaled", ($"view" - mean_view) / std_view)
我有一个包含三列的数据文件,我想规范化最后一列以将 ALS 与 ML(Spark 和 Scala)一起应用,我该怎么做?
这是我的 Dataframe
:
val view_df = spark.createDataFrame(view_RDD, viewSchema)
val viewdd = view_df.withColumn("userIdTemp", view_df("userId").cast(IntegerType)).drop("userId")
.withColumnRenamed("userIdTemp", "userId")
.withColumn("productIdTemp", view_df("productId").cast(IntegerType)).drop("productId")
.withColumnRenamed("productIdTemp", "productId")
.withColumn("viewTemp", view_df("view").cast(FloatType)).drop("view")
.withColumnRenamed("viewTemp", "view")`
可以先使用StandardScaler
is usually what you want to do when there is any scaling/normalization to be done. However, in this case there is only a single column to scale and it's not of Vector
type (but Float
). Since the StandardScaler
only works on Vectors
, a VectorAssembler
,但Vector
需要在缩放后重新转换为Float
。
在这种情况下,更简单的方法是自己动手。首先获取列的均值和标准差,然后执行缩放。可以在 view
列上完成,如下所示:
val (mean_view, std_view) = viewdd.select(mean("view"), stddev("view"))
.as[(Double, Double)]
.first()
viewdd.withColumn("view_scaled", ($"view" - mean_view) / std_view)