Scala 变量在 for-each 循环/字符串被截断后重置

Question

我正在做一个 Spark 项目。在下面的代码中，我有一个字符串，我用它来收集我的结果以便稍后写入文件（我知道这不是正确的方法，我只是检查返回的 Tuple3 中的内容一个方法）。在 for each 循环之后字符串被截断。这是我的代码的相关部分：

val newLine = sys.props("line.separator") // also tried "\n". I am using OS X.

var str = s"*** ${newLine}"

for (tuple3 <- ArrayOfTuple3s) {
  for (list <- tuple3._3) {
    for (strItem <- list) {
      str += s"${strItem}, "
    }
    str += s"${newLine}"
  }
  str += s"${newLine}"
  println(tempStr)
}

print("str=" + str)

第一个println方法调用打印出正确的字符串值（拼接结果），但是当循环结束时，str的值是***（相同在第一个循环之前分配给它的值）。

编辑： 我用 StringBuilder 替换了 str 不可变 String 对象，但结果没有变化：

val newLine: String = sys.props("line.separator")

var str1: StringBuilder = new StringBuilder(15000)

for (tuple3 <- ArrayOfTuple3s) {
  for (list <- tuple3._3) {
    for (str <- list) {
      str1.append(s"${str}, ")
    }
    str1.append(s"${newLine}")
  }
  str1.append(s"${newLine}")
  println(str1.toString())
}

print("resulting str1=" + str1.toString())

编辑 2： 我将 RDD 映射为直接获取 Tuple3 的第三个字段。该字段本身是列表数组的 RDD。我相应地更改了代码，但我仍然得到相同的结果（结果字符串为空，尽管在 for 循环中它不是）。

val rddOfArraysOfLists = getArrayOfTuple3s(mainRdd).map(_._3)

for (arrayOfLists <- rddOfArraysOfLists) {
  for (list <- arrayOfLists) {
    for (field <- list) {
      str1.append(s"${field}, ")
    }
    str1.append(" -- ")
  }
  str1.append(s"${newLine}")
  println(str1.toString())
}

编辑 4: 我认为问题根本不在于字符串。所有类型的变量似乎都有问题。

var count = 0

for (arrayOfLists <- myArray) {
  count = arrayOfLists.last(3).toInt
  println(s"count=$count")
}

println(s"count=$count")

该值在循环内非零，但在循环外为0。有什么想法吗？

编辑 5: 我不能发布整个代码（由于保密限制），但这是它的主要部分。如果重要的话，我在 Intellij Idea 的本地机器上运行 Spark（用于调试）。

System.setProperty("spark.cores.max", "8")
System.setProperty("spark.executor.memory", "15g")    
val sc = new SparkContext("local", getClass.getName)            
val samReg = sc.objectFile[Sample](sampleLocation, 200).distinct

val samples = samReg.filter(f => f.uuid == "dce03545e8034242").sortBy(_.time).cache()

val top3Samples = samples.take(3)
for (sample <- top3Samples) {
  print("sample: ")
  println(s"uuid=${sample.uuid}, time=${sample.time}, model=${sample.model}")
}

val firstTimeStamp = samples.first.time
val targetTime = firstTimeStamp + 2592000 // + 1 month in seconds (samples during the first month)

val rddOfArrayOfSamples = getCountsRdd(samples.filter(_.time <= targetTime)).map(_._1).cache()
// Due to confidentiality matters, I cannot reveal the code, 
// but here is a description:
// I have an array of samples. Each sample has a few String fields 
// and is represented by a List[String]
// The above RDD is of the type RDD[Array[List[String]]]. 
// It contains only a single array of samples
// (because I passed a filtered set of samples to the function), 
// but it may contain more.
// The fourth field of each sample (list) is an increasing number (count)

println(s"number of arrays in the RDD: ${rddOfArrayOfSamples.count()}")

var maxCount = 0
for (arrayOfLists <- rddOfArrayOfSamples) {
  println(s"Last item of the array (a list)=${arrayOfLists.last}")
  maxCount = arrayOfLists.last(3).toInt
  println(s"maxCount=${maxCount}")
}
println(s"maxCount=${maxCount}")

输出：

示例：uuid=dce03545e8034242，时间=1360037324，型号=Nexus 4

示例：uuid=dce03545e8034242，时间=1360037424，型号=Nexus 4

示例：uuid=dce03545e8034242，时间=1360037544，型号=Nexus 4

RDD中的数组数：1

数组的最后一项（列表）=List(dce03545e8034242, Nexus 4, 1362628767, 32, 2089, 0.97, 0.15999999999999992, 0)

maxCount=32

maxCount=0

Answer 1

由于您没有 post 完整的示例，我不得不对部分代码进行仲裁。

我为您进行的第 4 次编辑：

val myArray = Array(
  List(List(0, 0, 0, 0), List(0, 0, 0, 0), List(0, 0, 0, 0)),
  List(List(1, 1, 1, 1), List(1, 1, 1, 1), List(1, 1, 1, 1)),
  List(List(2, 2, 2, 2), List(2, 2, 2, 2), List(2, 2, 2, 2))
)

运行在 REPL 中：

var count = 0

for (arrayOfLists <- myArray) {
  count = arrayOfLists.last(3).toInt
  println(s"count=$count")
}

println(s"count=$count")

我得到：

scala> for (arrayOfLists <- myArray) {
     |   count = arrayOfLists.last(3).toInt
     |   println(s"count=$count")
     | }
count=0
count=1
count=2

scala> println(s"count=$count")
count=2

值在循环内非零并且在循环外非零。

如果你post一个完整例子，也许我们可以帮助你更多。

Answer 2

在对答案的评论中更新我的解释：

参见 this answer 一个有点相关的问题：

Not to get into too many details, but when you run different transformations on a RDD (map, flatMap, filter and others), your transformation code (closure) is:

serialized on the driver node,
shipped to the appropriate nodes in the cluster,
deserialized,
and finally executed on the nodes

您代码中的 for 只是 map.

的语法糖

因此，每次执行更新的 maxCount 在您的调用程序中并不相同 maxCount。那一个永远不会改变。

这里的教训是不要使用在块外更新变量的闭包（块）

Scala 变量在 for-each 循环/字符串被截断后重置

Scala variable resets after for-each loop / string gets truncated

string

scala

apache-spark