如何从火花中的笛卡尔积中删除重复项
How to drop duplicates from a cartesian product in spark
我创建了一组单词的交叉连接,以比较它们在 Spark 中的相似性。但是,我试图摆脱那些自 (word1, word2) = (word2, word1) 得分以来重复出现的条目。我有以下 table,看起来像这样;
+-------+-------+-------+
| col1 | col2 | score |
+-------+-------+-------+
| word1 | word1 | 1 |
| word1 | word2 | 0.345 |
| word1 | word3 | 0.432 |
| word2 | word1 | 0.345 |
| word2 | word2 | 1 |
| word2 | word3 | 0.543 |
| word3 | word1 | 0.432 |
| word3 | word2 | 0.543 |
| word3 | word3 | 1 |
+-------+-------+-------+
理想情况下,我希望获得这样的结果,其中不重复比较:
+-------+-------+-------+
| col1 | col2 | score |
+-------+-------+-------+
| word1 | word1 | 1 |
| word1 | word2 | 0.345 |
| word1 | word3 | 0.432 |
| word2 | word2 | 1 |
| word2 | word3 | 0.543 |
| word3 | word3 | 1 |
+-------+-------+-------+
将 col1
和 col2
合并为一个列表,并使用 sort_array
按字母顺序对它们进行排序。排序后,对它们执行 .distinct()
将删除重复项。现在您可以将列表解压缩为 col1
和 col2
:
from pyspark.sql import functions as F
from pyspark.sql.functions import sort_array
df.withColumn("sorted_list", sort_array(F.array([F.col("col1"), F.col("col2")])))\
.select("sorted_list", "score").distinct()\
.select(F.col("sorted_list")[0].alias("col1"), \
F.col("sorted_list")[1].alias("col2"), "score").show()
输出:
+-----+-----+-----+
| col1| col2|score|
+-----+-----+-----+
|word1|word1| 1.0|
|word1|word2|0.345|
|word1|word3|0.432|
|word2|word2| 1.0|
|word2|word3|0.543|
|word3|word3| 1.0|
+-----+-----+-----+
我创建了一组单词的交叉连接,以比较它们在 Spark 中的相似性。但是,我试图摆脱那些自 (word1, word2) = (word2, word1) 得分以来重复出现的条目。我有以下 table,看起来像这样;
+-------+-------+-------+
| col1 | col2 | score |
+-------+-------+-------+
| word1 | word1 | 1 |
| word1 | word2 | 0.345 |
| word1 | word3 | 0.432 |
| word2 | word1 | 0.345 |
| word2 | word2 | 1 |
| word2 | word3 | 0.543 |
| word3 | word1 | 0.432 |
| word3 | word2 | 0.543 |
| word3 | word3 | 1 |
+-------+-------+-------+
理想情况下,我希望获得这样的结果,其中不重复比较:
+-------+-------+-------+
| col1 | col2 | score |
+-------+-------+-------+
| word1 | word1 | 1 |
| word1 | word2 | 0.345 |
| word1 | word3 | 0.432 |
| word2 | word2 | 1 |
| word2 | word3 | 0.543 |
| word3 | word3 | 1 |
+-------+-------+-------+
将 col1
和 col2
合并为一个列表,并使用 sort_array
按字母顺序对它们进行排序。排序后,对它们执行 .distinct()
将删除重复项。现在您可以将列表解压缩为 col1
和 col2
:
from pyspark.sql import functions as F
from pyspark.sql.functions import sort_array
df.withColumn("sorted_list", sort_array(F.array([F.col("col1"), F.col("col2")])))\
.select("sorted_list", "score").distinct()\
.select(F.col("sorted_list")[0].alias("col1"), \
F.col("sorted_list")[1].alias("col2"), "score").show()
输出:
+-----+-----+-----+
| col1| col2|score|
+-----+-----+-----+
|word1|word1| 1.0|
|word1|word2|0.345|
|word1|word3|0.432|
|word2|word2| 1.0|
|word2|word3|0.543|
|word3|word3| 1.0|
+-----+-----+-----+