Apache Spark 中的 Jaro-Winkler 分数计算
Jaro-Winkler score calculation in Apache Spark
我们需要在 Apache Spark Dataset 中跨字符串实现 Jaro-Winkler 距离计算。我们是 spark 的新手,在网上搜索后我们找不到太多东西。如果您能指导我们,那就太好了。我们想到了使用 flatMap 然后意识到它无济于事,然后我们尝试使用几个 foreach 循环但无法确定如何前进。由于每个字符串都必须与所有字符串进行比较。就像下面的数据集一样。
RowFactory.create(0, "Hi I heard about Spark"),
RowFactory.create(1,"I wish Java could use case classes"),
RowFactory.create(2,"Logistic,regression,models,are,neat"));
在上述数据框中找到的所有字符串的 jaro winkler 得分示例。
Distance score between label, 0,1 -> 0.56
Distance score between
label, 0,2 -> 0.77
Distance score between label, 0,3 -> 0.45
Distance score between label, 1,2 -> 0.77
Distance score between
label, 2,3 -> 0.79
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import info.debatty.java.stringsimilarity.JaroWinkler;
public class JaroTestExample {
public static void main( String[] args )
{
System.setProperty("hadoop.home.dir", "C:\winutil");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder()
.appName("JavaTokenizerExample").getOrCreate();
JaroWinkler jw = new JaroWinkler();
// substitution of s and t
System.out.println(jw.similarity("My string", "My tsring"));
// substitution of s and n
System.out.println(jw.similarity("My string", "My ntrisg"));
List<Row> data = Arrays.asList(
RowFactory.create(0, "Hi I heard about Spark"),
RowFactory.create(1,"I wish Java could use case classes"),
RowFactory.create(2,"Logistic,regression,models,are,neat"));
StructType schema = new StructType(new StructField[] {
new StructField("label", DataTypes.IntegerType, false,
Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false,
Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
sentenceDataFrame.foreach();
}
}
可以使用以下代码在 spark 中进行交叉连接
Dataset2Object=Dataset1Object.crossJoin(Dataset2Object)
在 Dataset2Object 中,您可以获得所需的记录对的所有组合。在这种情况下,平面图不会有帮助。
请记得使用版本spark-sql_2.11 版本2.1.0
Scala
您可以按如下方式使用 spark-stringmetric 库:
import com.github.mrpowers.spark.stringmetric.SimilarityFunctions
df.withColumn(
"w1_w2_jaro_winkler",
SimilarityFunctions.jaro_winkler(col("word1"), col("word2"))
)
PySpark
您可以按如下方式使用 ceja 库:
import ceja
df.withColumn("jaro_winkler_similarity", ceja.jaro_winkler_similarity(col("word1"), col("word2")))
我们需要在 Apache Spark Dataset 中跨字符串实现 Jaro-Winkler 距离计算。我们是 spark 的新手,在网上搜索后我们找不到太多东西。如果您能指导我们,那就太好了。我们想到了使用 flatMap 然后意识到它无济于事,然后我们尝试使用几个 foreach 循环但无法确定如何前进。由于每个字符串都必须与所有字符串进行比较。就像下面的数据集一样。
RowFactory.create(0, "Hi I heard about Spark"),
RowFactory.create(1,"I wish Java could use case classes"),
RowFactory.create(2,"Logistic,regression,models,are,neat"));
在上述数据框中找到的所有字符串的 jaro winkler 得分示例。
Distance score between label, 0,1 -> 0.56
Distance score between label, 0,2 -> 0.77
Distance score between label, 0,3 -> 0.45
Distance score between label, 1,2 -> 0.77
Distance score between label, 2,3 -> 0.79
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import info.debatty.java.stringsimilarity.JaroWinkler;
public class JaroTestExample {
public static void main( String[] args )
{
System.setProperty("hadoop.home.dir", "C:\winutil");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder()
.appName("JavaTokenizerExample").getOrCreate();
JaroWinkler jw = new JaroWinkler();
// substitution of s and t
System.out.println(jw.similarity("My string", "My tsring"));
// substitution of s and n
System.out.println(jw.similarity("My string", "My ntrisg"));
List<Row> data = Arrays.asList(
RowFactory.create(0, "Hi I heard about Spark"),
RowFactory.create(1,"I wish Java could use case classes"),
RowFactory.create(2,"Logistic,regression,models,are,neat"));
StructType schema = new StructType(new StructField[] {
new StructField("label", DataTypes.IntegerType, false,
Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false,
Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
sentenceDataFrame.foreach();
}
}
可以使用以下代码在 spark 中进行交叉连接 Dataset2Object=Dataset1Object.crossJoin(Dataset2Object) 在 Dataset2Object 中,您可以获得所需的记录对的所有组合。在这种情况下,平面图不会有帮助。 请记得使用版本spark-sql_2.11 版本2.1.0
Scala
您可以按如下方式使用 spark-stringmetric 库:
import com.github.mrpowers.spark.stringmetric.SimilarityFunctions
df.withColumn(
"w1_w2_jaro_winkler",
SimilarityFunctions.jaro_winkler(col("word1"), col("word2"))
)
PySpark
您可以按如下方式使用 ceja 库:
import ceja
df.withColumn("jaro_winkler_similarity", ceja.jaro_winkler_similarity(col("word1"), col("word2")))