如何将特定的 string/regex 拆分为不同的字符串作为单独的字符串？

Question

这里演示一下我写的代码得到的内容。我正在使用 Source.fromFile(filePath) 懒惰地读取文件并使用 .getLines() 方法将每一行读取为字符串并对其进行迭代以查找其中是否出现某个 word/pattern。

让我们考虑要匹配的模式为“.read”。 .如果整个模式出现在同一行上，line.contains(".read.") 就可以正常工作。如果它以以下任何方式分布在后续行中，就会出现问题：

.
read.

或

.
read
.

或

.read
.

我什至无法将文件的全部内容收集到 List[String] 中，因为内存消耗太大，以至于无法通过使用索引连接上一行或下一行，因为它是正在使用的 bufferedSource 迭代器。

val bufferedSource = Source.fromFile("C:/code.scala")
val key = ".read."
var lineCounter = 0
for (bufferedline <- bufferedSource.getLines()) {
    lineCounter+=1
    val line = bufferedline.trim
    if (line.length() != 0) {
    if(line.contains(".read."))
        println("Found pattern at : "+lineCounter)
    }
}

当模式分布在多个字符串而不是一个由换行符分隔的字符串时，我不确定如何合并更改。任何有关如何解决此类问题的帮助将不胜感激。

注意 - 如果要匹配的模式分布在 3 行中，这只是一个简单的示例，但是如果要找到一个特定的字符串 "spark.read.option" 并且它分布在 5 个不同的行中，则可能存在这种情况行。

Answer 1

如果我尝试这样做，我会：

放弃使用 getLines()。它使在多行文本中找到目标变得复杂。
放弃使用正则表达式模式作为目标字符串。找到可能有或可能没有多个 \n 字符的匹配项要求很高。

因此，我会使用 character-by-character 搜索来寻找设定的目标。

def findInFile(charItr :Iterator[Char], target :String) :Unit = {
  assert(target.nonEmpty)
  def consumeStr(subci   :Iterator[Char]
                ,str     :String
                ,lineNum :Int
                ) :Option[(Iterator[Char],Int)] =
    if      (str.isEmpty)    Some((subci,lineNum))
    else if (!subci.hasNext) None
    else subci.next() match {
      case '\n'               => consumeStr(subci, str, lineNum + 1)
      case c if c == str.head => consumeStr(subci, str.tail, lineNum)
      case _                  => None
    }

  def loop(ci :Iterator[Char], line :Int) :Unit = if (ci.hasNext) {
    ci.next() match {
      case '\n' => loop(ci, line+1)
      case c if c == target.head =>
        val (oldci,newci) = ci.duplicate
        consumeStr(newci, target.tail, line).fold(loop(oldci, line)){
          case (itr,ln) => println(s"target found: line $line")
                           loop(itr,ln)
        }
      case _ => loop(ci, line)
    }
  }

  loop(charItr, 1)
}

这是我使用的测试文件...

xxx
x
aa
aaaa
a.b
b.c
cccc
a
aa.bb.caaa.bb.cc.dd
xxx

...以及我搜索的测试目标

val src = io.Source.fromFile("so.txt")
findInFile(src, "aaa.bb.cc")
src.close()
//target found: line 4
//target found: line 9

好的，所以我稍微调整了一下 findInFile()。

def findInFile(charItr :Iterator[Char], target :String) :List[(Int,String)] = {
  assert(target.nonEmpty)
  def consumeStr(subci   :Iterator[Char]
                ,str     :String
                ,lineNum :Int
                ) :Option[(Iterator[Char],Int)] =
    if      (str.isEmpty)    Some((subci,lineNum))
    else if (!subci.hasNext) None
    else subci.next() match {
      case '\n'               => consumeStr(subci, str, lineNum + 1)
      case c if c == str.head => consumeStr(subci, str.tail, lineNum)
      case _                  => None
    }

  def loop(ci :Iterator[Char], line :Int) :List[(Int,String)] =
    if (ci.hasNext) {
      ci.next() match {
        case '\n' => loop(ci, line+1)
        case c if c == target.head =>
          val (oldci,newci) = ci.duplicate
          consumeStr(newci, target.tail, line).fold(loop(oldci, line)){
            (line,target) :: (loop _).tupled(_)
          }
        case _ => loop(ci, line)
      }
    } else Nil

  loop(charItr, 1)
}

有了这个，并使用与以前相同的测试文件，我们可以执行以下操作：

val src1 = io.Source.fromFile("so.txt")  //open twice
val src2 = io.Source.fromFile("so.txt")

"a{2,3}.bb.c[ac]".r                                   //regex pattern
                 .findAllIn(src1.getLines().mkString) //all matches
                 .toSeq.distinct                      //remove duplicates
                 .foldLeft(src2.duplicate -> List.empty[(Int,String)]){
                   case (((srcA,srcB),lst),str) =>
                     (srcA.duplicate, lst ++ findInFile(srcB,str))
                 }._2.sorted
//res0: List[(Int, String)] = List((4,aa.bb.cc), (4,aaa.bb.cc), (8,aaa.bb.ca), (9,aa.bb.cc), (9,aaa.bb.cc))

src1.close()  //close up and go home
src2.close()

想法是首先将整个文件作为不带换行符的 String 读入内存，然后找到所有正则表达式匹配并将它们转换为所有唯一匹配字符串的列表。然后将每个发送到 findInFile()。排序并 return.

效率不是很高，但可以完成工作。

如何将特定的 string/regex 拆分为不同的字符串作为单独的字符串？

How to match a particular string/regex split over different lines as seperate strings?

regex

scala

multiline