如何从列中取出唯一值并在 pyspark 中分组后创建具有某些条件的另一列
How to takeout unique values from column and create another column with some condition after grouping in pyspark
我有 table 个这样的:
我想要什么table B:
当我们执行 groupby(r,z) 时,我们得到了上述组合,但是如何将 v 列从 table A 拆分为 v_num 列
v_num是tableA中提到的除99以外的数字,如果99在分组时我们应该算它但v_num与99相距甚远
如果我们进入第 1,2 组组合,我们应该分开行。
请帮助我,在此先感谢!!
spark>=2.4
spark.sql(
"""
|select r, z, FILTER(v, x -> x != 99) as v_num, size(v) as count
|FROM
|(select r, z, collect_list(v) as v
|from table
|group by r, z) a
""".stripMargin)
.show()
//If you wanted to take the first element as v_num then change query as below
spark.sql(
"""
|select r, z, FILTER(v, x -> x != 99)[0] as v_num, size(v) as count
|FROM
|(select r, z, collect_list(v) as v
|from table
|group by r, z) a
""".stripMargin)
.show()
与@someshwar 提到的解决方案相同,但在 pyspark DF 中 API
df = df.groupBy('r','z').agg(collect_list('v').alias('v')).\
select('r','z',expr('''filter(v,x->x!=99)''').alias('v_num'), size(v).alias('count'))
我有 table 个这样的:
我想要什么table B:
当我们执行 groupby(r,z) 时,我们得到了上述组合,但是如何将 v 列从 table A 拆分为 v_num 列 v_num是tableA中提到的除99以外的数字,如果99在分组时我们应该算它但v_num与99相距甚远 如果我们进入第 1,2 组组合,我们应该分开行。
请帮助我,在此先感谢!!
spark>=2.4
spark.sql(
"""
|select r, z, FILTER(v, x -> x != 99) as v_num, size(v) as count
|FROM
|(select r, z, collect_list(v) as v
|from table
|group by r, z) a
""".stripMargin)
.show()
//If you wanted to take the first element as v_num then change query as below
spark.sql(
"""
|select r, z, FILTER(v, x -> x != 99)[0] as v_num, size(v) as count
|FROM
|(select r, z, collect_list(v) as v
|from table
|group by r, z) a
""".stripMargin)
.show()
与@someshwar 提到的解决方案相同,但在 pyspark DF 中 API
df = df.groupBy('r','z').agg(collect_list('v').alias('v')).\
select('r','z',expr('''filter(v,x->x!=99)''').alias('v_num'), size(v).alias('count'))