Spark Scala - 如何对数据帧行进行分组并将复杂函数应用于组?
Spark Scala - How to group dataframe rows and apply complex function to the groups?
我正在尝试解决这个超级简单的问题,但我已经厌倦了它,我希望有人能帮助我解决这个问题。我有一个像这样的数据框:
--------------------------
|类别 | Product_ID |
|------------+------------+
|一个 |产品 1 |
|一个 |产品 2 |
|一个 |产品 3 |
|一个 |产品 1 |
|一个 |产品 4 |
|乙 |产品 5 |
|乙 |产品 6 |
--------------------------
如何按类别对这些行进行分组并在 Scala 中应用复杂的函数?
也许是这样的:
val result = df.groupBy("Category").apply(myComplexFunction)
这个 myComplexFunction 应该为每个类别生成以下 table 并将成对相似性上传到 Hive table 或将其保存到 HDFS 中:
+------------------------------------------------ -+
| | Product_1 | Product_2 | Product_3 |
+------------+------------+-------------------- -+
| Product_1 | 1.0 | 0.1 | 0.8 |
| Product_2 | 0.1 | 1.0 | 0.5 |
| Product_3 | 0.8 | 0.5 | 1.0 |
+------------------------------------------------ -+
这是我要应用的函数(它只是计算每个类别中的项目余弦相似度):
def myComplexFunction(context_data : DataFrame, country_name: String,
context_id: String, table_name_correlations: String,
context_layer: String, context_index: String) : Boolean = {
val unique_identifier = country_name + "_" + context_layer + "_" + context_index
val temp_table_vocabulary = "temp_vocabulary_" + unique_identifier
val temp_table_similarities = "temp_similarities_" + unique_identifier
val temp_table_correlations = "temp_correlations_" + unique_identifier
//context.count()
// fit a CountVectorizerModel from the corpus
//println("Creating sparse incidence matrix")
val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("words").setOutputCol("features").fit(context_data)
val incidence = cvModel.transform(context_data)
// ========================================================================================
// create dataframe of mapping from indices into the item id
//println("Creating vocabulary")
val vocabulary_rdd = sc.parallelize(cvModel.vocabulary)
val rows_vocabulary_rdd = vocabulary_rdd.zipWithIndex.map{ case (s,i) => Row(s,i)}
val vocabulary_field1 = StructField("Product_ID", StringType, true)
val vocabulary_field2 = StructField("Product_Index", LongType, true)
val schema_vocabulary = StructType(Seq(vocabulary_field1, vocabulary_field2))
val df_vocabulary = hiveContext.createDataFrame(rows_vocabulary_rdd, schema_vocabulary)
// ========================================================================================
//println("Computing similarity matrix")
val myvectors = incidence.select("features").rdd.map(r => r(0).asInstanceOf[Vector])
val mat: RowMatrix = new RowMatrix(myvectors)
val sims = mat.columnSimilarities(0.0)
// ========================================================================================
// Convert records of the Matrix Entry RDD into Rows
//println("Extracting paired similarities")
val rowRdd = sims.entries.map{case MatrixEntry(i, j, v) => Row(i, j, v)}
// ========================================================================================
// create dataframe schema
//println("Creating similarity dataframe")
val field1 = StructField("Product_Index", LongType, true)
val field2 = StructField("Neighbor_Index", LongType, true)
var field3 = StructField("Similarity_Score", DoubleType, true)
val schema_similarities = StructType(Seq(field1, field2, field3))
// create the dataframe
val df_similarities = hiveContext.createDataFrame(rowRdd, schema_similarities)
// ========================================================================================
//println("Register vocabulary and correlations as spark temp tables")
df_vocabulary.registerTempTable(temp_table_vocabulary)
df_similarities.registerTempTable(temp_table_similarities)
// ========================================================================================
//println("Extracting Product_ID")
val temp_corrs = hiveContext.sql(
s"SELECT T1.Product_ID, T2.Neighbor_ID, T1.Similarity_Score " +
s"FROM " +
s"(SELECT Product_ID, Neighbor_Index, Similarity_Score " +
s"FROM $temp_table_similarities LEFT JOIN $temp_table_vocabulary " +
s"WHERE $temp_table_similarities.Product_Index = $temp_table_vocabulary.Product_Index) AS T1 " +
s"LEFT JOIN " +
s"(SELECT Product_ID AS Neighbor_ID, Product_Index as Neighbor_Index FROM $temp_table_vocabulary) AS T2 " +
s"ON " +
s"T1.Neighbor_Index = T2.Neighbor_Index")
// ========================================================================================
val context_corrs = temp_corrs.withColumn("Context_Layer", lit(context_layer)).withColumn("Context_ID", lit(context_id)).withColumn("Country", lit(country_name))
context_corrs.registerTempTable(temp_table_correlations)
// ========================================================================================
hiveContext.sql(s"INSERT INTO TABLE $table_name_correlations SELECT * FROM $temp_table_correlations")
// ========================================================================================
// clean up environment
//println("Cleaning up temp tables")
hiveContext.dropTempTable(temp_table_correlations)
hiveContext.dropTempTable(temp_table_similarities)
hiveContext.dropTempTable(temp_table_vocabulary)
return true
}
val partitioned = tokenized.repartition(tokenized("context_id"))
val context_counts = partitioned.mapPartitions()
//val context_counts = model_code_ids.zipWithIndex.map{case (model_code_id, context_index) => compute_similarity(tokenized.filter(tokenized("context_id") === model_code_id), country_name, model_code_id.asInstanceOf[String], table_name_correlations, context_layer, context_index.toString)}
}
我已经尝试过以下操作:
val category_ids = df.select("Category").distinct.collect()
val result = category_ids.map(category_id => myComplexFunction(df.filter(df("Category") <=> category_id)))
我不知道为什么,但这种方法是按顺序运行的,而不是并行运行的。
余弦相似度不是一个复杂的函数,可以使用标准 SQL 聚合表示。让我们考虑以下示例:
val df = Seq(
("feat1", 1.0, "item1"),
("feat2", 1.0, "item1"),
("feat6", 1.0, "item1"),
("feat1", 1.0, "item2"),
("feat3", 1.0, "item2"),
("feat4", 1.0, "item3"),
("feat5", 1.0, "item3"),
("feat1", 1.0, "item4"),
("feat6", 1.0, "item4")
).toDF("feature", "value", "item")
其中feature
是特征标识符,value
是对应值,item
是对象标识符,feature
、item
对只有一个相应的值。
余弦相似度定义为:
其中分子可以计算为:
val numer = df.as("this").withColumnRenamed("item", "this")
.join(df.as("other").withColumnRenamed("item", "other"), Seq("feature"))
.where($"this" < $"other")
.groupBy($"this", $"other")
.agg(sum($"this.value" * $"other.value").alias("dot"))
和分母中使用的范数为:
import org.apache.spark.sql.functions.sqrt
val norms = df.groupBy($"item").agg(sqrt(sum($"value" * $"value")).alias("norm"))
// 组合在一起:
val cosine = ($"dot" / ($"this_norm.norm" * $"other_norm.norm")).as("cosine")
val similarities = numer
.join(norms.alias("this_norm").withColumnRenamed("item", "this"), Seq("this"))
.join(norms.alias("other_norm").withColumnRenamed("item", "other"), Seq("other"))
.select($"this", $"other", cosine)
结果表示上三角矩阵的非零项忽略对角线(这是微不足道的):
+-----+-----+-------------------+
| this|other| cosine|
+-----+-----+-------------------+
|item1|item4| 0.8164965809277259|
|item1|item2|0.40824829046386296|
|item2|item4| 0.4999999999999999|
+-----+-----+-------------------+
这应该等同于:
import org.apache.spark.sql.functions.array
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
import org.apache.spark.mllib.linalg.Vectors
val pivoted = df.groupBy("item").pivot("feature").sum()
.na.fill(0.0)
.orderBy("item")
val mat = new IndexedRowMatrix(pivoted
.select(array(pivoted.columns.tail.map(col): _*))
.rdd
.zipWithIndex
.map {
case (row, idx) =>
new IndexedRow(idx, Vectors.dense(row.getSeq[Double](0).toArray))
})
mat.toCoordinateMatrix.transpose
.toIndexedRowMatrix.columnSimilarities
.toBlockMatrix.toLocalMatrix
0.0 0.408248290463863 0.0 0.816496580927726
0.0 0.0 0.0 0.4999999999999999
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
关于您的代码:
- 执行是顺序的,因为您的代码在本地 (
collected
) 集合上运行。
myComplexFunction
无法进一步分布式,因为它是分布式数据结构和上下文。
我正在尝试解决这个超级简单的问题,但我已经厌倦了它,我希望有人能帮助我解决这个问题。我有一个像这样的数据框:
-------------------------- |类别 | Product_ID | |------------+------------+ |一个 |产品 1 | |一个 |产品 2 | |一个 |产品 3 | |一个 |产品 1 | |一个 |产品 4 | |乙 |产品 5 | |乙 |产品 6 | --------------------------
如何按类别对这些行进行分组并在 Scala 中应用复杂的函数? 也许是这样的:
val result = df.groupBy("Category").apply(myComplexFunction)
这个 myComplexFunction 应该为每个类别生成以下 table 并将成对相似性上传到 Hive table 或将其保存到 HDFS 中:
+------------------------------------------------ -+ | | Product_1 | Product_2 | Product_3 | +------------+------------+-------------------- -+ | Product_1 | 1.0 | 0.1 | 0.8 | | Product_2 | 0.1 | 1.0 | 0.5 | | Product_3 | 0.8 | 0.5 | 1.0 | +------------------------------------------------ -+
这是我要应用的函数(它只是计算每个类别中的项目余弦相似度):
def myComplexFunction(context_data : DataFrame, country_name: String,
context_id: String, table_name_correlations: String,
context_layer: String, context_index: String) : Boolean = {
val unique_identifier = country_name + "_" + context_layer + "_" + context_index
val temp_table_vocabulary = "temp_vocabulary_" + unique_identifier
val temp_table_similarities = "temp_similarities_" + unique_identifier
val temp_table_correlations = "temp_correlations_" + unique_identifier
//context.count()
// fit a CountVectorizerModel from the corpus
//println("Creating sparse incidence matrix")
val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("words").setOutputCol("features").fit(context_data)
val incidence = cvModel.transform(context_data)
// ========================================================================================
// create dataframe of mapping from indices into the item id
//println("Creating vocabulary")
val vocabulary_rdd = sc.parallelize(cvModel.vocabulary)
val rows_vocabulary_rdd = vocabulary_rdd.zipWithIndex.map{ case (s,i) => Row(s,i)}
val vocabulary_field1 = StructField("Product_ID", StringType, true)
val vocabulary_field2 = StructField("Product_Index", LongType, true)
val schema_vocabulary = StructType(Seq(vocabulary_field1, vocabulary_field2))
val df_vocabulary = hiveContext.createDataFrame(rows_vocabulary_rdd, schema_vocabulary)
// ========================================================================================
//println("Computing similarity matrix")
val myvectors = incidence.select("features").rdd.map(r => r(0).asInstanceOf[Vector])
val mat: RowMatrix = new RowMatrix(myvectors)
val sims = mat.columnSimilarities(0.0)
// ========================================================================================
// Convert records of the Matrix Entry RDD into Rows
//println("Extracting paired similarities")
val rowRdd = sims.entries.map{case MatrixEntry(i, j, v) => Row(i, j, v)}
// ========================================================================================
// create dataframe schema
//println("Creating similarity dataframe")
val field1 = StructField("Product_Index", LongType, true)
val field2 = StructField("Neighbor_Index", LongType, true)
var field3 = StructField("Similarity_Score", DoubleType, true)
val schema_similarities = StructType(Seq(field1, field2, field3))
// create the dataframe
val df_similarities = hiveContext.createDataFrame(rowRdd, schema_similarities)
// ========================================================================================
//println("Register vocabulary and correlations as spark temp tables")
df_vocabulary.registerTempTable(temp_table_vocabulary)
df_similarities.registerTempTable(temp_table_similarities)
// ========================================================================================
//println("Extracting Product_ID")
val temp_corrs = hiveContext.sql(
s"SELECT T1.Product_ID, T2.Neighbor_ID, T1.Similarity_Score " +
s"FROM " +
s"(SELECT Product_ID, Neighbor_Index, Similarity_Score " +
s"FROM $temp_table_similarities LEFT JOIN $temp_table_vocabulary " +
s"WHERE $temp_table_similarities.Product_Index = $temp_table_vocabulary.Product_Index) AS T1 " +
s"LEFT JOIN " +
s"(SELECT Product_ID AS Neighbor_ID, Product_Index as Neighbor_Index FROM $temp_table_vocabulary) AS T2 " +
s"ON " +
s"T1.Neighbor_Index = T2.Neighbor_Index")
// ========================================================================================
val context_corrs = temp_corrs.withColumn("Context_Layer", lit(context_layer)).withColumn("Context_ID", lit(context_id)).withColumn("Country", lit(country_name))
context_corrs.registerTempTable(temp_table_correlations)
// ========================================================================================
hiveContext.sql(s"INSERT INTO TABLE $table_name_correlations SELECT * FROM $temp_table_correlations")
// ========================================================================================
// clean up environment
//println("Cleaning up temp tables")
hiveContext.dropTempTable(temp_table_correlations)
hiveContext.dropTempTable(temp_table_similarities)
hiveContext.dropTempTable(temp_table_vocabulary)
return true
}
val partitioned = tokenized.repartition(tokenized("context_id"))
val context_counts = partitioned.mapPartitions()
//val context_counts = model_code_ids.zipWithIndex.map{case (model_code_id, context_index) => compute_similarity(tokenized.filter(tokenized("context_id") === model_code_id), country_name, model_code_id.asInstanceOf[String], table_name_correlations, context_layer, context_index.toString)}
}
我已经尝试过以下操作:
val category_ids = df.select("Category").distinct.collect()
val result = category_ids.map(category_id => myComplexFunction(df.filter(df("Category") <=> category_id)))
我不知道为什么,但这种方法是按顺序运行的,而不是并行运行的。
余弦相似度不是一个复杂的函数,可以使用标准 SQL 聚合表示。让我们考虑以下示例:
val df = Seq(
("feat1", 1.0, "item1"),
("feat2", 1.0, "item1"),
("feat6", 1.0, "item1"),
("feat1", 1.0, "item2"),
("feat3", 1.0, "item2"),
("feat4", 1.0, "item3"),
("feat5", 1.0, "item3"),
("feat1", 1.0, "item4"),
("feat6", 1.0, "item4")
).toDF("feature", "value", "item")
其中feature
是特征标识符,value
是对应值,item
是对象标识符,feature
、item
对只有一个相应的值。
余弦相似度定义为:
其中分子可以计算为:
val numer = df.as("this").withColumnRenamed("item", "this")
.join(df.as("other").withColumnRenamed("item", "other"), Seq("feature"))
.where($"this" < $"other")
.groupBy($"this", $"other")
.agg(sum($"this.value" * $"other.value").alias("dot"))
和分母中使用的范数为:
import org.apache.spark.sql.functions.sqrt
val norms = df.groupBy($"item").agg(sqrt(sum($"value" * $"value")).alias("norm"))
// 组合在一起:
val cosine = ($"dot" / ($"this_norm.norm" * $"other_norm.norm")).as("cosine")
val similarities = numer
.join(norms.alias("this_norm").withColumnRenamed("item", "this"), Seq("this"))
.join(norms.alias("other_norm").withColumnRenamed("item", "other"), Seq("other"))
.select($"this", $"other", cosine)
结果表示上三角矩阵的非零项忽略对角线(这是微不足道的):
+-----+-----+-------------------+
| this|other| cosine|
+-----+-----+-------------------+
|item1|item4| 0.8164965809277259|
|item1|item2|0.40824829046386296|
|item2|item4| 0.4999999999999999|
+-----+-----+-------------------+
这应该等同于:
import org.apache.spark.sql.functions.array
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
import org.apache.spark.mllib.linalg.Vectors
val pivoted = df.groupBy("item").pivot("feature").sum()
.na.fill(0.0)
.orderBy("item")
val mat = new IndexedRowMatrix(pivoted
.select(array(pivoted.columns.tail.map(col): _*))
.rdd
.zipWithIndex
.map {
case (row, idx) =>
new IndexedRow(idx, Vectors.dense(row.getSeq[Double](0).toArray))
})
mat.toCoordinateMatrix.transpose
.toIndexedRowMatrix.columnSimilarities
.toBlockMatrix.toLocalMatrix
0.0 0.408248290463863 0.0 0.816496580927726
0.0 0.0 0.0 0.4999999999999999
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
关于您的代码:
- 执行是顺序的,因为您的代码在本地 (
collected
) 集合上运行。 myComplexFunction
无法进一步分布式,因为它是分布式数据结构和上下文。