两个字符串列的 Spark-check 交集

Spark- check intersect of two string columns

我下面有一个数据框,其中 colAcolB 包含字符串。我正在尝试检查 colB 是否包含 colA 中的值的任何子字符串。 vaules 可以包含 , 或 space,但只要 colB 的字符串的任何部分与 colA 的字符串重叠,它就是一个匹配项。例如,下面的第 1 行有重叠(“bc”),而第 2 行没有。

我正在考虑将值拆分为数组,但分隔符不是常量。有人可以帮助阐明如何做到这一点吗?非常感谢您的帮助。

   +---+-------+-----------+
   | id|colA   | colB      |
   +---+-------+-----------+
   |  1|abc d  |  bc, z    |
   |  2|abcde  |  hj f     |
   +---+-------+-----------+

您可以使用自定义 UDF 来实现相交逻辑,如下所示 -

数据准备

from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F

from pyspark.sql.types import StringType

import pandas as pd

data = {"id" :[1,2],
    "colA" : ["abc d","abcde"],
    "colB" : ["bc, z","hj f"]}
mypd = pd.DataFrame(data)

sparkDF = sql.createDataFrame(mypd)

sparkDF.show()

+---+-----+-----+
| id| colA| colB|
+---+-----+-----+
|  1|abc d|bc, z|
|  2|abcde| hj f|
+---+-----+-----+

UDF

def str_intersect(x,y):
    
    res = set(x) & set(y)
    if res:
        return ''.join(res)
    else:
        return None

str_intersect_udf = F.udf(lambda x,y:str_intersect(x,y),StringType())

sparkDF.withColumn('intersect',str_intersect_udf(F.col('colA'),F.col('colB'))).show()

+---+-----+-----+---------+
| id| colA| colB|intersect|
+---+-----+-----+---------+
|  1|abc d|bc, z|      bc |
|  2|abcde| hj f|     null|
+---+-----+-----+---------+

您可以使用正则表达式拆分,然后创建一个 UDF 函数来检查子字符串。

示例:

spark = SparkSession.builder.getOrCreate()
data = [
    {"id": 1, "A": "abc d", "B": "bc, z, d"},
    {"id": 2, "A": "abc-d", "B": "acb, abc"},
    {"id": 3, "A": "abcde", "B": "hj f ab"},
]
df = spark.createDataFrame(data)
split_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.split(F.col("A"), split_regex))
df = df.withColumn("B", F.split(F.col("B"), split_regex))


def mapper(a, b):
    result = []
    for ele_b in b:
        for ele_a in a:
            if ele_b in ele_a:
                result.append(ele_b)
    return result


df = df.withColumn(
    "result", F.udf(mapper, ArrayType(StringType()))(F.col("A"), F.col("B"))
)

结果:

root
 |-- A: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- B: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- id: long (nullable = true)
 |-- result: array (nullable = true)
 |    |-- element: string (containsNull = true)

+--------+-----------+---+-------+                                              
|A       |B          |id |result |
+--------+-----------+---+-------+
|[abc, d]|[bc, z, d] |1  |[bc, d]|
|[abc, d]|[acb, abc] |2  |[abc]  |
|[abcde] |[hj, f, ab]|3  |[ab]   |
+--------+-----------+---+-------+