PySpark 中使用 Levenshtein 距离的两列之间的字符串匹配函数
String matching function between two columns using Levenshtein distance in PySpark
我正在尝试通过将名称对之间的编辑距离转换为匹配系数来比较名称对,例如:
coef = 1 - Levenstein(str1, str2) / max(length(str1) , length(str2))
但是,当我在 PySpark 中使用 withColumn() 实现它时,我在计算 max() 函数时遇到错误。 numpy.max 和 pyspark.sql.functions.max 都会抛出错误。任何的想法 ?
from pyspark.sql.functions import col, length, levenshtein
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
TableA = spark.createDataFrame(valuesA,['firstname','id'])
test_compare = TableA.withColumnRenamed('firstname', 'firstname2').withColumnRenamed('id', 'id2').crossJoin(TableA)
test_compare.withColumn("distance_firstname", levenshtein('firstname', 'firstname2') / max(length(col('firstname')), length(col('firstname2'))))
max
是一个聚合函数,用于查找您要使用的两个值之间的最大值 greatest
,也来自 pyspark.sql.functions
from pyspark.sql.functions import col, length, greatest
from pyspark.sql.functions import levenshtein
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
TableA = spark.createDataFrame(valuesA,['firstname','id'])
test_compare = TableA.withColumnRenamed('firstname', 'firstname2').withColumnRenamed('id', 'id2').crossJoin(TableA)
test_compare.withColumn("distance_firstname", levenshtein('firstname', 'firstname2') / greatest(length(col('firstname')), length(col('firstname2')))).show()
我正在尝试通过将名称对之间的编辑距离转换为匹配系数来比较名称对,例如:
coef = 1 - Levenstein(str1, str2) / max(length(str1) , length(str2))
但是,当我在 PySpark 中使用 withColumn() 实现它时,我在计算 max() 函数时遇到错误。 numpy.max 和 pyspark.sql.functions.max 都会抛出错误。任何的想法 ?
from pyspark.sql.functions import col, length, levenshtein
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
TableA = spark.createDataFrame(valuesA,['firstname','id'])
test_compare = TableA.withColumnRenamed('firstname', 'firstname2').withColumnRenamed('id', 'id2').crossJoin(TableA)
test_compare.withColumn("distance_firstname", levenshtein('firstname', 'firstname2') / max(length(col('firstname')), length(col('firstname2'))))
max
是一个聚合函数,用于查找您要使用的两个值之间的最大值 greatest
,也来自 pyspark.sql.functions
from pyspark.sql.functions import col, length, greatest
from pyspark.sql.functions import levenshtein
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
TableA = spark.createDataFrame(valuesA,['firstname','id'])
test_compare = TableA.withColumnRenamed('firstname', 'firstname2').withColumnRenamed('id', 'id2').crossJoin(TableA)
test_compare.withColumn("distance_firstname", levenshtein('firstname', 'firstname2') / greatest(length(col('firstname')), length(col('firstname2')))).show()