创建序列序列导致 StackOverflowException

Creating Sequence of Sequences is Causing a StackOverflowException

我正在尝试将一个大文件分割成许多小文件。每个拆分发生的位置基于检查每个给定行的内容返回的谓词(isNextObject 函数)。

我试图通过 File.ReadLines 函数读取大文件,这样我就可以一次一行地遍历文件,而不必将整个文件保存在内存中。我的方法是将序列分组为一系列较小的子序列(每个文件一个要写出)。

我发现了 Tomas Petricek 在 fssnip 上创建的一个名为 groupWhen 的有用函数。此函数非常适合我对文件的一小部分进行的初始测试,但在使用真实文件时会抛出 WhosebugException。我不确定如何调整 groupWhen 函数来防止这种情况(我仍然是 F# 菜鸟)。

这是代码的简化版本,仅显示将重新创建 WhosebugExcpetion::

的相关部分
// This is the function created by Tomas Petricek where the WhosebugExcpetion is occuring
module Seq =
  /// Iterates over elements of the input sequence and groups adjacent elements.
  /// A new group is started when the specified predicate holds about the element
  /// of the sequence (and at the beginning of the iteration).
  ///
  /// For example: 
  ///    Seq.groupWhen isOdd [3;3;2;4;1;2] = seq [[3]; [3; 2; 4]; [1; 2]]
  let groupWhen f (input:seq<_>) = seq {
    use en = input.GetEnumerator()
    let running = ref true

    // Generate a group starting with the current element. Stops generating
    // when it founds element such that 'f en.Current' is 'true'
    let rec group() = 
      [ yield en.Current
        if en.MoveNext() then
          if not (f en.Current) then yield! group() // *** Exception occurs here ***
        else running := false ]

    if en.MoveNext() then
      // While there are still elements, start a new group
      while running.Value do
        yield group() |> Seq.ofList } 

这是使用 Tomas 函数的代码要点:

module Extractor =

    open System
    open System.IO
    open Microsoft.FSharp.Reflection

    // ... elided a few functions include "isNextObject" which is
    //     a string -> bool (examines the line and returns true
    //     if the string meets the criteria to that we are at the 
    //     start of the next inner file)

    let writeFile outputDir file =
        // ... write out "file" to the file system
        // NOTE: file is a seq<string>

    let writeFiles outputDir (files : seq<seq<_>>) =
        files
        |> Seq.iter (fun file -> writeFile outputDir file)

下面是控制台应用程序中使用这些函数的相关代码:

let lines = inputFile |> File.ReadLines

writeFiles outputDir (lines |> Seq.groupWhen isNextObject)

关于阻止 groupWhen 炸毁堆栈的正确方法有什么想法吗?我不确定如何将函数转换为使用累加器(或改为使用延续,我认为这是正确的术语)。

问题在于 group() 函数 return 是一个列表,这是一个急切求值的数据结构,这意味着每次调用 group() 它都必须 运行结束,将所有结果收集在一个列表中,并return列表。这意味着递归调用发生在同一评估中 - 即真正递归地 - 从而产生堆栈压力。

为了缓解这个问题,您可以将列表替换为惰性序列:

let rec group() = seq {
   yield en.Current
   if en.MoveNext() then
     if not (f en.Current) then yield! group()
   else running := false }

不过,我会考虑不那么激进的方法。这个例子很好地说明了为什么你应该避免自己做递归,而是求助于现成的折叠。

例如,从您的描述来看,Seq.windowed似乎适合您。

在 F# 中很容易过度使用序列,IMO。您可能会不小心发生堆栈溢出,而且它们很慢。

所以(实际上并没有回答你的问题), 就我个人而言,我只会使用这样的方式折叠行的顺序:

let isNextObject line = 
    line = "---"

type State = {
    fileIndex : int
    filename: string
    writer: System.IO.TextWriter
    }

let makeFilename index  = 
    sprintf "File%i" index

let closeFile (state:State) =
    //state.writer.Close() // would use this in real code
    state.writer.WriteLine("=== Closing {0} ===",state.filename)

let createFile index =
    let newFilename = makeFilename index 
    let newWriter = System.Console.Out // dummy
    newWriter.WriteLine("=== Creating {0} ===",newFilename)
    // create new state with new writer 
    {fileIndex=index + 1; writer = newWriter; filename=newFilename }

let writeLine (state:State) line = 
    if isNextObject line then
        /// finish old file here    
        closeFile state
        /// create new file here and return updated state
        createFile state.fileIndex
    else
        //write the line to the current file
        state.writer.WriteLine(line)
        // return the unchanged state
        state

let processLines (lines: string seq) =
    //setup
    let initialState = createFile 1
    // process the file
    let finalState = lines |> Seq.fold writeLine initialState
    // tidy up
    closeFile finalState

(显然真实版本会使用文件而不是控制台)

是的,它很粗糙,但很容易推理,与 没有不愉快的惊喜。

这是一个测试:

processLines [
    "a"; "b"
    "---";"c"; "d"
    "---";"e"; "f"
]

输出如下:

=== Creating File1 ===
a
b
=== Closing File1 ===
=== Creating File2 ===
c
d
=== Closing File2 ===
=== Creating File3 ===
e
f
=== Closing File3 ===