如何从列中取出唯一值并在 pyspark 中分组后创建具有某些条件的另一列

How to takeout unique values from column and create another column with some condition after grouping in pyspark

我有 table 个这样的:

我想要什么table B:

当我们执行 groupby(r,z) 时,我们得到了上述组合,但是如何将 v 列从 table A 拆分为 v_num 列 v_num是tableA中提到的除99以外的数字,如果99在分组时我们应该算它但v_num与99相距甚远 如果我们进入第 1,2 组组合,我们应该分开行。

请帮助我,在此先感谢!!

spark>=2.4

spark.sql(
      """
        |select r, z, FILTER(v, x -> x != 99) as v_num, size(v) as count
        |FROM
        |(select r, z, collect_list(v) as v
        |from table
        |group by r, z) a
      """.stripMargin)
      .show()
//If you wanted to take the first element as v_num then change query as below

spark.sql(
      """
        |select r, z, FILTER(v, x -> x != 99)[0] as v_num, size(v) as count
        |FROM
        |(select r, z, collect_list(v) as v
        |from table
        |group by r, z) a
      """.stripMargin)
      .show()

与@someshwar 提到的解决方案相同,但在 pyspark DF 中 API

df = df.groupBy('r','z').agg(collect_list('v').alias('v')).\
select('r','z',expr('''filter(v,x->x!=99)''').alias('v_num'), size(v).alias('count'))