Return 使用 spark 从文件中获取的唯一元素 col 值 scala/python

Question

考虑一个包含以上两列的文件，并且在商店中有包含差异产品的产品列，我需要 return 只有一个商店中的唯一产品和 return 商店名称。我尝试了以下方法，但正在寻找有效的解决方案。提前致谢。

store   products
walmart  eggs,cereals,milk
target   toys,eggs,cereals
costco   eggs,cereals,milk

val df1 = dataDF.select("prods").agg(collect_list("prods")).collect.toArray
df1(0).getSeq[String](0).toList.map(x => x.split(",")).flatten.groupBy((word: String) => word).mapValues(_.length).filter(x=> x._2==1 ).keys.head

=> this returns toys, then filter that respective store from df. But it doesn't seem efficient .

预期输出

目标玩具

Answer 1

你可以试试这个：

dataDf
  .withColumn("products", split($"products", ","))  // Parse as array
  .withColumn("product", explode($"products"))      // Explode into rows
  .groupBy($"product")
  .agg(collect_list($"store").as("stores"))         // Get list of stores as array
  .filter(size($"stores") === 1)                    // Where there's only one store selling
  .show

Return 使用 spark 从文件中获取的唯一元素 col 值 scala/python

Return unique element col value from the file using spark scala/python

scala

apache-spark

pyspark