如何使用 Spark SQL 检查列中的值是否在列的列表中找到？

Question

我有一个 delta table A，如下所示。

point	cluster	points_in_cluster
37	1	[37,32]
45	2	[45,67,84]
67	2	[45,67,84]
84	2	[45,67,84]
32	1	[37,32]

我还有一个 table B 如下所示。

id	point
101	37
102	67
103	84

我想要如下查询。这里 in 显然不适用于列表。那么，正确的语法是什么？

select b.id, a.point
from A a, B b
where b.point in a.points_in_cluster

因此我应该有一个 table 如下所示

id	point
101	37
101	32
102	45
102	67
102	84
103	45
103	67
103	84

Answer 1

根据您的数据样本，我会在 point 列上进行等值连接，然后在 points_in_cluster 上进行 explode :

from pyspark.sql import functions as F

# assuming A is df_A and B is df_B

df_A.join(
    df_B,
    on="point"
).select(
    "id",
    F.explode("points_in_cluster").alias("point")
)

否则，你使用array_contains:

select b.id, a.point
from A a, B b
where array_contains(a.points_in_cluster, b.point)

如何使用 Spark SQL 检查列中的值是否在列的列表中找到？

How to check if a value in a column is found in a list in a column, with Spark SQL?

apache-spark

apache-spark-sql

pyspark