scala使用模糊字符串匹配合并元组

scala merge tuples using fuzzy string match

输入:

val input = List((a, 10 Inches), (a, 10.00 inches), (a, 15 in), (b, 2 cm), (b, 2.00 CM))

我喜欢有输出

val output = List((a, 10 Inches, 0.66), (b, 2 cm, 1))

我还有一个效用函数,returns 模糊匹配(“10 英寸”、“10.00 英寸”)

fuzzyMatch(s1, s2) returns

true for s1 = "10 Inches" and s2 = "10.00 inches"
false for s1 = "10 Inches" and s2 = "15 in"
false for s1 = "10.00 inches" and s2 = "15 in"
true for s1 = "2 cm" and s2 = "2.00 CM"

Output = List of (unique_name, max occurred string value, (max number of occurrences/total occurrences))

我怎样才能将上面的输入减少到输出

到目前为止我有什么

val tupleMap = input.groupBy(identity).mapValues(_.size)
val totalOccurrences = input.groupBy(_._1).mapValues(_.size)
val maxNumberOfValueOccurrences = tupleMap.groupBy(_._1._1).mapValues(_.values.max)
val processedInput = tupleMap
      .filter {
        case (k, v) => v == maxNumberOfValueOccurrences(k._1)
      }
      .map {
        case (k, v) => (k._1, k._2, v.toDouble / totalOccurrences(k._1))
      }.toSeq

这是给出完全匹配的比率。我怎样才能适应我的模糊匹配,以便它可以对所有相似的值进行分组并计算比率?模糊匹配值可以是任意匹配。

它本质上是使用我的 fuzzyMatch(...) 方法的自定义 groupBy。但是我在这里想不出解决办法。

经过深思熟虑,我得到了如下结果。更好的解决方案将不胜感激。

val tupleMap: Map[String, Seq[String]] = input.groupBy(_._1).mapValues(_.map(_._2))

val result = tupleMap mapValues {
list =>
val valueCountsMap: mutable.Map[String, Int] = mutable.Map[String, Int]()

list foreach {
  value =>
    // Using fuzzy match to find the best match
    // findBestMatch (uses fuzzyMatch) returns the Option(key) 
    // if there exists a similar key, if not returns None
    val bestMatch = findBestMatch(value, valueCountsMap.keySet.toSeq) 
    if (bestMatch.isDefined) {
      val newValueCount = valueCountsMap.getOrElse(bestMatch.get, 0) + 1
      valueCountsMap(bestMatch.get) = newValueCount
    } else {
      valueCountsMap(value) = 1
    }
}

val maxOccurredValueNCount: (String, Int) = valueCountsMap.maxBy(_._2)
(maxOccurredValueNCount._1, maxOccurredValueNCount._2)
}

如果出于某种原因转换为数值的方法对您不起作用,这里的代码似乎可以满足您的需求:

def fuzzyMatch(s1: String, s2: String): Boolean = {
  // fake implementation
  val matches = List(("15 Inches", "15.00 inches"), ("2 cm", "2.00 CM"))
  s1.equals(s2) || matches.exists({
    case (m1, m2) => (m1.equals(s1) && m2.equals(s2)) || (m1.equals(s2) && m2.equals(s1))
  })
}

 def test(): Unit = {
  val input = List(("a", "15 Inches"), ("a", "15.00 inches"), ("a", "10 in"), ("b", "2 cm"), ("b", "2.00 CM"))
  val byKey = input.groupBy(_._1).mapValues(l => l.map(_._2))
  val totalOccurrences = byKey.mapValues(_.size)
  val maxByKey = byKey.mapValues(_.head) //random "max" selection logic

  val processedInput: List[(String, String, Double)] = maxByKey.map({
    case (mk, mv) =>
      val matchCount = byKey(mk).count(tv => fuzzyMatch(tv, mv))
      (mk, mv, matchCount / totalOccurrences(mk).asInstanceOf[Double])
  })(breakOut)

  println(processedInput)
}

这会打印

List((b,2 cm,1.0), (a,15 Inches,0.6666666666666666))

这是一种使用模糊匹配预处理 input 的方法,然后将其用作现有代码的输入。

想法是首先生成 input 元组的 2 种组合,对它们进行模糊匹配以创建由每个键的匹配值组成的不同集合的映射,最后使用映射来模糊-匹配您原来的 input.

为了确保涵盖更多任意情况,我扩展了您的 input:

val input = List(
  ("a", "10 in"), ("a", "15 in"), ("a", "10 inches"), ("a", "15 Inches"), ("a", "15.00 inches"),
  ("b", "2 cm"), ("b", "4 cm"), ("b", "2.00 CM"),
  ("c", "7 cm"), ("c", "7 in")
)

// Trivialized fuzzy match
def fuzzyMatch(s1: String, s2: String): Boolean = {
  val st1 = s1.toLowerCase.replace(".00", "").replace("inches", "in")
  val st2 = s2.toLowerCase.replace(".00", "").replace("inches", "in")
  st1 == st2
}

// Create a Map of Sets of fuzzy-matched values from all 2-combinations per key
val fuzMap = input.combinations(2).foldLeft( Map[String, Seq[Set[String]]]() ){
  case (m, Seq(t1: Tuple2[String, String], t2: Tuple2[String, String])) =>
    if (fuzzyMatch(t1._2, t2._2)) {
      val fuzSets = m.getOrElse(t1._1, Seq(Set(t1._2, t2._2))).map(
        x => if (x.contains(t1._2) || x.contains(t2._2)) x ++ Set(t1._2, t2._2) else x
      )
      if (!fuzSets.flatten.contains(t1._2) && !fuzSets.flatten.contains(t2._2))
        m + (t1._1 -> (fuzSets :+ Set(t1._2, t2._2)))
      else
        m + (t1._1 -> fuzSets)
    }
    else
      m
}
// fuzMap: scala.collection.immutable.Map[String,Seq[Set[String]]] = Map(
//   a -> List(Set(10 in, 10 inches), Set(15 in, 15 Inches, 15.00 inches)), 
//   b -> List(Set(2 cm, 2.00 CM)))
// )

请注意,对于大 input,首先 groupBy 键并为每个键生成 2 个组合可能是有意义的。

下一步是使用创建的地图对原始输入进行模糊匹配:

// Fuzzy-match original input using fuzMap
val fuzInput = input.map{ case (k, v) => 
  if (fuzMap.get(k).isDefined) {
    val fuzValues = fuzMap(k).map{
      case x => if (x.contains(v)) Some(x.min) else None
    }.flatten
    if (!fuzValues.isEmpty)
      (k, fuzValues.head)
    else
      (k, v)
  }
  else
    (k, v)
}
// fuzInput: List[(String, String)] = List(
//   (a,10 in), (a,15 Inches), (a,10 in), (a,15 Inches), (a,15 Inches),
//   (b,2 cm), (b,4 cm), (b,2 cm),
//   (c,7 cm), (c,7 in)
// )