在 Spark 中用作 HashMap 键时的 Scala case class object 'key not found'

Question

我试图通过 Spark 2.0 中的函数访问 HashMap，但是如果我并行化 List，它会失败。如果我不这样做，它就可以工作，如果我不使用 Case Class，它就可以工作。

这是我正在尝试做的一些示例代码：

case class TestData(val s: String)

def testKey(testData: TestData) {
  println(f"Current Map: $myMap")
  println(f"Key sent into function: $testData")
  println("Key isn't found in Map:")
  println(myMap(testData)) // fails here
}

val myList = sc.parallelize(List(TestData("foo")))
val myMap = Map(TestData("foo") -> "bar")
myList.collect.foreach(testKey) // collect to see println

这是准确的输出：

Current Map: Map(TestData(foo) -> bar)
Key sent into function: TestData(foo)
Key isn't found in Map:
java.util.NoSuchElementException: key not found: TestData(foo)

上面的代码与我正在尝试做的类似，只是 class 的情况更复杂并且 HashMap 具有列表作为值。同样在上面的示例中，我使用 'collect' 以便输出打印语句。该示例在没有收集的情况下仍然给出相同的错误，但没有打印。

hashCodes 已经匹配，但我尝试为案例 class 覆盖 equals 和 hashCode，同样的问题。

这是使用 Databricks，所以我认为我无法访问 REPL 或 spark-submit。

Answer 1

感谢指出的评论，这些评论涉及到 Spark 问题，这让我为我的案例找到了这个解决方案：

case class TestData(val s: String) {
  override def equals(obj: Any) = obj.isInstanceOf[TestData] && obj.asInstanceOf[TestData].s == this.s
}

重写 equals 以包含 isInstanceOf 可以解决此问题。它可能不是最好的解决方案，但绝对是最简单的解决方法。

Answer 2

你的逻辑是循环的而且是错误的。您正在将相同的 RDD 传递给 Map 并使用 TestData 调用。更新它以使其按如下顺序排列：

case class TestData(val s: String)

def testKey(testData: TestData) {
  val myMap = Map(testData -> "bar")
  println(f"Current Map: $myMap")
  println(f"Key sent into function: $testData")
  println("Key isn't found in Map:")
  println(myMap(testData)) // fails here
}

val myList = sc.parallelize(List(TestData("foo")))
myList.collect.foreach(testKey)

它的输出是：

Current Map: Map(TestData(foo) -> bar)
Key sent into function: TestData(foo)
Key isn't found in Map:
bar

希望这就是您所期待的...

在 Spark 中用作 HashMap 键时的 Scala case class object 'key not found'

Scala case class object 'key not found' when used as a HashMap key in Spark

scala

hashmap

case-class

apache-spark

databricks