
scala merge tuples using fuzzy string match


val input = List((a, 10 Inches), (a, 10.00 inches), (a, 15 in), (b, 2 cm), (b, 2.00 CM))


val output = List((a, 10 Inches, 0.66), (b, 2 cm, 1))

我还有一个效用函数,returns 模糊匹配(“10 英寸”、“10.00 英寸”)

fuzzyMatch(s1, s2) returns

true for s1 = "10 Inches" and s2 = "10.00 inches"
false for s1 = "10 Inches" and s2 = "15 in"
false for s1 = "10.00 inches" and s2 = "15 in"
true for s1 = "2 cm" and s2 = "2.00 CM"

Output = List of (unique_name, max occurred string value, (max number of occurrences/total occurrences))



val tupleMap = input.groupBy(identity).mapValues(_.size)
val totalOccurrences = input.groupBy(_._1).mapValues(_.size)
val maxNumberOfValueOccurrences = tupleMap.groupBy(_._1._1).mapValues(_.values.max)
val processedInput = tupleMap
      .filter {
        case (k, v) => v == maxNumberOfValueOccurrences(k._1)
      .map {
        case (k, v) => (k._1, k._2, v.toDouble / totalOccurrences(k._1))


它本质上是使用我的 fuzzyMatch(...) 方法的自定义 groupBy。但是我在这里想不出解决办法。


val tupleMap: Map[String, Seq[String]] = input.groupBy(_._1).mapValues(

val result = tupleMap mapValues {
list =>
val valueCountsMap: mutable.Map[String, Int] = mutable.Map[String, Int]()

list foreach {
  value =>
    // Using fuzzy match to find the best match
    // findBestMatch (uses fuzzyMatch) returns the Option(key) 
    // if there exists a similar key, if not returns None
    val bestMatch = findBestMatch(value, valueCountsMap.keySet.toSeq) 
    if (bestMatch.isDefined) {
      val newValueCount = valueCountsMap.getOrElse(bestMatch.get, 0) + 1
      valueCountsMap(bestMatch.get) = newValueCount
    } else {
      valueCountsMap(value) = 1

val maxOccurredValueNCount: (String, Int) = valueCountsMap.maxBy(_._2)
(maxOccurredValueNCount._1, maxOccurredValueNCount._2)


def fuzzyMatch(s1: String, s2: String): Boolean = {
  // fake implementation
  val matches = List(("15 Inches", "15.00 inches"), ("2 cm", "2.00 CM"))
  s1.equals(s2) || matches.exists({
    case (m1, m2) => (m1.equals(s1) && m2.equals(s2)) || (m1.equals(s2) && m2.equals(s1))

 def test(): Unit = {
  val input = List(("a", "15 Inches"), ("a", "15.00 inches"), ("a", "10 in"), ("b", "2 cm"), ("b", "2.00 CM"))
  val byKey = input.groupBy(_._1).mapValues(l =>
  val totalOccurrences = byKey.mapValues(_.size)
  val maxByKey = byKey.mapValues(_.head) //random "max" selection logic

  val processedInput: List[(String, String, Double)] ={
    case (mk, mv) =>
      val matchCount = byKey(mk).count(tv => fuzzyMatch(tv, mv))
      (mk, mv, matchCount / totalOccurrences(mk).asInstanceOf[Double])



List((b,2 cm,1.0), (a,15 Inches,0.6666666666666666))

这是一种使用模糊匹配预处理 input 的方法,然后将其用作现有代码的输入。

想法是首先生成 input 元组的 2 种组合,对它们进行模糊匹配以创建由每个键的匹配值组成的不同集合的映射,最后使用映射来模糊-匹配您原来的 input.

为了确保涵盖更多任意情况,我扩展了您的 input:

val input = List(
  ("a", "10 in"), ("a", "15 in"), ("a", "10 inches"), ("a", "15 Inches"), ("a", "15.00 inches"),
  ("b", "2 cm"), ("b", "4 cm"), ("b", "2.00 CM"),
  ("c", "7 cm"), ("c", "7 in")

// Trivialized fuzzy match
def fuzzyMatch(s1: String, s2: String): Boolean = {
  val st1 = s1.toLowerCase.replace(".00", "").replace("inches", "in")
  val st2 = s2.toLowerCase.replace(".00", "").replace("inches", "in")
  st1 == st2

// Create a Map of Sets of fuzzy-matched values from all 2-combinations per key
val fuzMap = input.combinations(2).foldLeft( Map[String, Seq[Set[String]]]() ){
  case (m, Seq(t1: Tuple2[String, String], t2: Tuple2[String, String])) =>
    if (fuzzyMatch(t1._2, t2._2)) {
      val fuzSets = m.getOrElse(t1._1, Seq(Set(t1._2, t2._2))).map(
        x => if (x.contains(t1._2) || x.contains(t2._2)) x ++ Set(t1._2, t2._2) else x
      if (!fuzSets.flatten.contains(t1._2) && !fuzSets.flatten.contains(t2._2))
        m + (t1._1 -> (fuzSets :+ Set(t1._2, t2._2)))
        m + (t1._1 -> fuzSets)
// fuzMap: scala.collection.immutable.Map[String,Seq[Set[String]]] = Map(
//   a -> List(Set(10 in, 10 inches), Set(15 in, 15 Inches, 15.00 inches)), 
//   b -> List(Set(2 cm, 2.00 CM)))
// )

请注意,对于大 input,首先 groupBy 键并为每个键生成 2 个组合可能是有意义的。


// Fuzzy-match original input using fuzMap
val fuzInput ={ case (k, v) => 
  if (fuzMap.get(k).isDefined) {
    val fuzValues = fuzMap(k).map{
      case x => if (x.contains(v)) Some(x.min) else None
    if (!fuzValues.isEmpty)
      (k, fuzValues.head)
      (k, v)
    (k, v)
// fuzInput: List[(String, String)] = List(
//   (a,10 in), (a,15 Inches), (a,10 in), (a,15 Inches), (a,15 Inches),
//   (b,2 cm), (b,4 cm), (b,2 cm),
//   (c,7 cm), (c,7 in)
// )