创建序列序列导致 StackOverflowException

Question

我正在尝试将一个大文件分割成许多小文件。每个拆分发生的位置基于检查每个给定行的内容返回的谓词（isNextObject 函数）。

我试图通过 File.ReadLines 函数读取大文件，这样我就可以一次一行地遍历文件，而不必将整个文件保存在内存中。我的方法是将序列分组为一系列较小的子序列（每个文件一个要写出）。

我发现了 Tomas Petricek 在 fssnip 上创建的一个名为 groupWhen 的有用函数。此函数非常适合我对文件的一小部分进行的初始测试，但在使用真实文件时会抛出 WhosebugException。我不确定如何调整 groupWhen 函数来防止这种情况（我仍然是 F# 菜鸟）。

这是代码的简化版本，仅显示将重新创建 WhosebugExcpetion::

的相关部分

// This is the function created by Tomas Petricek where the WhosebugExcpetion is occuring
module Seq =
  /// Iterates over elements of the input sequence and groups adjacent elements.
  /// A new group is started when the specified predicate holds about the element
  /// of the sequence (and at the beginning of the iteration).
  ///
  /// For example: 
  ///    Seq.groupWhen isOdd [3;3;2;4;1;2] = seq [[3]; [3; 2; 4]; [1; 2]]
  let groupWhen f (input:seq<_>) = seq {
    use en = input.GetEnumerator()
    let running = ref true

    // Generate a group starting with the current element. Stops generating
    // when it founds element such that 'f en.Current' is 'true'
    let rec group() = 
      [ yield en.Current
        if en.MoveNext() then
          if not (f en.Current) then yield! group() // *** Exception occurs here ***
        else running := false ]

    if en.MoveNext() then
      // While there are still elements, start a new group
      while running.Value do
        yield group() |> Seq.ofList }

这是使用 Tomas 函数的代码要点：

module Extractor =

    open System
    open System.IO
    open Microsoft.FSharp.Reflection

    // ... elided a few functions include "isNextObject" which is
    //     a string -> bool (examines the line and returns true
    //     if the string meets the criteria to that we are at the 
    //     start of the next inner file)

    let writeFile outputDir file =
        // ... write out "file" to the file system
        // NOTE: file is a seq<string>

    let writeFiles outputDir (files : seq<seq<_>>) =
        files
        |> Seq.iter (fun file -> writeFile outputDir file)

下面是控制台应用程序中使用这些函数的相关代码：

let lines = inputFile |> File.ReadLines

writeFiles outputDir (lines |> Seq.groupWhen isNextObject)

关于阻止 groupWhen 炸毁堆栈的正确方法有什么想法吗？我不确定如何将函数转换为使用累加器（或改为使用延续，我认为这是正确的术语）。

Answer 1

问题在于 group() 函数 return 是一个列表，这是一个急切求值的数据结构，这意味着每次调用 group() 它都必须运行结束，将所有结果收集在一个列表中，并return列表。这意味着递归调用发生在同一评估中 - 即真正递归地 - 从而产生堆栈压力。

为了缓解这个问题，您可以将列表替换为惰性序列：

let rec group() = seq {
   yield en.Current
   if en.MoveNext() then
     if not (f en.Current) then yield! group()
   else running := false }

不过，我会考虑不那么激进的方法。这个例子很好地说明了为什么你应该避免自己做递归，而是求助于现成的折叠。

例如，从您的描述来看，Seq.windowed似乎适合您。

Answer 2

在 F# 中很容易过度使用序列，IMO。您可能会不小心发生堆栈溢出，而且它们很慢。

所以（实际上并没有回答你的问题），就我个人而言，我只会使用这样的方式折叠行的顺序：

let isNextObject line = 
    line = "---"

type State = {
    fileIndex : int
    filename: string
    writer: System.IO.TextWriter
    }

let makeFilename index  = 
    sprintf "File%i" index

let closeFile (state:State) =
    //state.writer.Close() // would use this in real code
    state.writer.WriteLine("=== Closing {0} ===",state.filename)

let createFile index =
    let newFilename = makeFilename index 
    let newWriter = System.Console.Out // dummy
    newWriter.WriteLine("=== Creating {0} ===",newFilename)
    // create new state with new writer 
    {fileIndex=index + 1; writer = newWriter; filename=newFilename }

let writeLine (state:State) line = 
    if isNextObject line then
        /// finish old file here    
        closeFile state
        /// create new file here and return updated state
        createFile state.fileIndex
    else
        //write the line to the current file
        state.writer.WriteLine(line)
        // return the unchanged state
        state

let processLines (lines: string seq) =
    //setup
    let initialState = createFile 1
    // process the file
    let finalState = lines |> Seq.fold writeLine initialState
    // tidy up
    closeFile finalState

（显然真实版本会使用文件而不是控制台）

是的，它很粗糙，但很容易推理，与没有不愉快的惊喜。

这是一个测试：

processLines [
    "a"; "b"
    "---";"c"; "d"
    "---";"e"; "f"
]

输出如下：

=== Creating File1 ===
a
b
=== Closing File1 ===
=== Creating File2 ===
c
d
=== Closing File2 ===
=== Creating File3 ===
e
f
=== Closing File3 ===

创建序列序列导致 StackOverflowException

Creating Sequence of Sequences is Causing a StackOverflowException

recursion

f#

sequences