Scala 过滤器和收集速度很慢

Question

我刚刚开始进行 Scala 开发，并尝试使用过滤器和收集从迭代器中过滤掉不必要的行。但是运行起来好像太慢了。

val src = Source.fromFile("/home/Documents/1987.csv") // 1.2 Million
val iter = src.getLines().map(_.split(":"))
val iter250 = iter.take(250000) // Only interested in the first 250,000

val intrestedIndices = range(1, 100000, 3).toSeq // This could be any order

val slicedData = iter250.zipWithIndex

// Takes 3 minutes
val firstCase = slicedData.collect { case (x, i) if intrestedIndices.contains(i) => x }.size 

// Takes 3 minutes
val secondCase = slicedData.filter(x => intrestedIndices.contains(x._2)).size 

// Takes 1 second
val thirdCase = slicedData.collect { case (x,i ) if i % 3 == 0 => x}.size

在第一种和第二种情况下，intrestedIndices.contains(_) 部分似乎在减慢程序速度。有没有其他方法可以加快这个过程。

Answer 1

郑重声明，这是一种使用（有序的）索引序列进行过滤的方法，不一定等距，无需在每一步扫描索引：

def filterInteresting[T](it: Iterator[T], indices: Seq[Int]): Iterator[T] =
  it.zipWithIndex.scanLeft((indices, None: Option[T])) {
    case ((indices, _), (elem, index)) => indices match {
      case h :: t if h == index => (t, Some(elem))
      case l => (l, None)
    }
  }.map(_._2).flatten

Answer 2

这个答案帮助解决了问题。

您在线性时间内迭代了前两种情况下的所有 interestedIndices。使用 Set 而不是 Seq 来提高性能 – Sergey Lagutin

Scala 过滤器和收集速度很慢

Scala Filter and Collect is slow

optimization

performance

scala

filter

collect