Return 使用 spark 从文件中获取的唯一元素 col 值 scala/python
Return unique element col value from the file using spark scala/python
考虑一个包含以上两列的文件,并且在商店中有包含差异产品的产品列,我需要 return 只有一个商店中的唯一产品和 return 商店名称。我尝试了以下方法,但正在寻找有效的解决方案。提前致谢。
store products
walmart eggs,cereals,milk
target toys,eggs,cereals
costco eggs,cereals,milk
val df1 = dataDF.select("prods").agg(collect_list("prods")).collect.toArray
df1(0).getSeq[String](0).toList.map(x => x.split(",")).flatten.groupBy((word: String) => word).mapValues(_.length).filter(x=> x._2==1 ).keys.head
=> this returns toys, then filter that respective store from df. But it doesn't seem efficient .
预期输出
目标玩具
你可以试试这个:
dataDf
.withColumn("products", split($"products", ",")) // Parse as array
.withColumn("product", explode($"products")) // Explode into rows
.groupBy($"product")
.agg(collect_list($"store").as("stores")) // Get list of stores as array
.filter(size($"stores") === 1) // Where there's only one store selling
.show
考虑一个包含以上两列的文件,并且在商店中有包含差异产品的产品列,我需要 return 只有一个商店中的唯一产品和 return 商店名称。我尝试了以下方法,但正在寻找有效的解决方案。提前致谢。
store products
walmart eggs,cereals,milk
target toys,eggs,cereals
costco eggs,cereals,milk
val df1 = dataDF.select("prods").agg(collect_list("prods")).collect.toArray
df1(0).getSeq[String](0).toList.map(x => x.split(",")).flatten.groupBy((word: String) => word).mapValues(_.length).filter(x=> x._2==1 ).keys.head
=> this returns toys, then filter that respective store from df. But it doesn't seem efficient .
预期输出
目标玩具
你可以试试这个:
dataDf
.withColumn("products", split($"products", ",")) // Parse as array
.withColumn("product", explode($"products")) // Explode into rows
.groupBy($"product")
.agg(collect_list($"store").as("stores")) // Get list of stores as array
.filter(size($"stores") === 1) // Where there's only one store selling
.show