计算一行中的唯一值

Question

测试数据：

df = spark.createDataFrame([(1, 1), (2, 3), (3, 3)], ['c1', 'c2'])
df.show()
#+---+---+
#| c1| c2|
#+---+---+
#|  1|  1|
#|  2|  3|
#|  3|  3|
#+---+---+

我打算在每个行中计算不同的值，用计数创建一个单独的列。怎么做？

想要的结果：

#+---+---+---+
#| c1| c2| c3|
#+---+---+---+
#|  1|  1|  1|
#|  2|  3|  2|
#|  3|  3|  1|
#+---+---+---+

Answer 1

检查 array_distinct 的大小：

import pyspark.sql.functions as F

df.withColumn('c3', F.size(F.array_distinct(F.array(*df.columns)))).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
|  1|  1|  1|
|  2|  3|  2|
|  3|  3|  1|
+---+---+---+

计算一行中的唯一值

Count unique values in a row

row

unique

apache-spark

apache-spark-sql

pyspark