创建序列序列导致 StackOverflowException
Creating Sequence of Sequences is Causing a StackOverflowException
我正在尝试将一个大文件分割成许多小文件。每个拆分发生的位置基于检查每个给定行的内容返回的谓词(isNextObject
函数)。
我试图通过 File.ReadLines
函数读取大文件,这样我就可以一次一行地遍历文件,而不必将整个文件保存在内存中。我的方法是将序列分组为一系列较小的子序列(每个文件一个要写出)。
我发现了 Tomas Petricek 在 fssnip 上创建的一个名为 groupWhen 的有用函数。此函数非常适合我对文件的一小部分进行的初始测试,但在使用真实文件时会抛出 WhosebugException。我不确定如何调整 groupWhen 函数来防止这种情况(我仍然是 F# 菜鸟)。
这是代码的简化版本,仅显示将重新创建 WhosebugExcpetion::
的相关部分
// This is the function created by Tomas Petricek where the WhosebugExcpetion is occuring
module Seq =
/// Iterates over elements of the input sequence and groups adjacent elements.
/// A new group is started when the specified predicate holds about the element
/// of the sequence (and at the beginning of the iteration).
///
/// For example:
/// Seq.groupWhen isOdd [3;3;2;4;1;2] = seq [[3]; [3; 2; 4]; [1; 2]]
let groupWhen f (input:seq<_>) = seq {
use en = input.GetEnumerator()
let running = ref true
// Generate a group starting with the current element. Stops generating
// when it founds element such that 'f en.Current' is 'true'
let rec group() =
[ yield en.Current
if en.MoveNext() then
if not (f en.Current) then yield! group() // *** Exception occurs here ***
else running := false ]
if en.MoveNext() then
// While there are still elements, start a new group
while running.Value do
yield group() |> Seq.ofList }
这是使用 Tomas 函数的代码要点:
module Extractor =
open System
open System.IO
open Microsoft.FSharp.Reflection
// ... elided a few functions include "isNextObject" which is
// a string -> bool (examines the line and returns true
// if the string meets the criteria to that we are at the
// start of the next inner file)
let writeFile outputDir file =
// ... write out "file" to the file system
// NOTE: file is a seq<string>
let writeFiles outputDir (files : seq<seq<_>>) =
files
|> Seq.iter (fun file -> writeFile outputDir file)
下面是控制台应用程序中使用这些函数的相关代码:
let lines = inputFile |> File.ReadLines
writeFiles outputDir (lines |> Seq.groupWhen isNextObject)
关于阻止 groupWhen 炸毁堆栈的正确方法有什么想法吗?我不确定如何将函数转换为使用累加器(或改为使用延续,我认为这是正确的术语)。
问题在于 group()
函数 return 是一个列表,这是一个急切求值的数据结构,这意味着每次调用 group()
它都必须 运行结束,将所有结果收集在一个列表中,并return列表。这意味着递归调用发生在同一评估中 - 即真正递归地 - 从而产生堆栈压力。
为了缓解这个问题,您可以将列表替换为惰性序列:
let rec group() = seq {
yield en.Current
if en.MoveNext() then
if not (f en.Current) then yield! group()
else running := false }
不过,我会考虑不那么激进的方法。这个例子很好地说明了为什么你应该避免自己做递归,而是求助于现成的折叠。
例如,从您的描述来看,Seq.windowed
似乎适合您。
在 F# 中很容易过度使用序列,IMO。您可能会不小心发生堆栈溢出,而且它们很慢。
所以(实际上并没有回答你的问题),
就我个人而言,我只会使用这样的方式折叠行的顺序:
let isNextObject line =
line = "---"
type State = {
fileIndex : int
filename: string
writer: System.IO.TextWriter
}
let makeFilename index =
sprintf "File%i" index
let closeFile (state:State) =
//state.writer.Close() // would use this in real code
state.writer.WriteLine("=== Closing {0} ===",state.filename)
let createFile index =
let newFilename = makeFilename index
let newWriter = System.Console.Out // dummy
newWriter.WriteLine("=== Creating {0} ===",newFilename)
// create new state with new writer
{fileIndex=index + 1; writer = newWriter; filename=newFilename }
let writeLine (state:State) line =
if isNextObject line then
/// finish old file here
closeFile state
/// create new file here and return updated state
createFile state.fileIndex
else
//write the line to the current file
state.writer.WriteLine(line)
// return the unchanged state
state
let processLines (lines: string seq) =
//setup
let initialState = createFile 1
// process the file
let finalState = lines |> Seq.fold writeLine initialState
// tidy up
closeFile finalState
(显然真实版本会使用文件而不是控制台)
是的,它很粗糙,但很容易推理,与
没有不愉快的惊喜。
这是一个测试:
processLines [
"a"; "b"
"---";"c"; "d"
"---";"e"; "f"
]
输出如下:
=== Creating File1 ===
a
b
=== Closing File1 ===
=== Creating File2 ===
c
d
=== Closing File2 ===
=== Creating File3 ===
e
f
=== Closing File3 ===
我正在尝试将一个大文件分割成许多小文件。每个拆分发生的位置基于检查每个给定行的内容返回的谓词(isNextObject
函数)。
我试图通过 File.ReadLines
函数读取大文件,这样我就可以一次一行地遍历文件,而不必将整个文件保存在内存中。我的方法是将序列分组为一系列较小的子序列(每个文件一个要写出)。
我发现了 Tomas Petricek 在 fssnip 上创建的一个名为 groupWhen 的有用函数。此函数非常适合我对文件的一小部分进行的初始测试,但在使用真实文件时会抛出 WhosebugException。我不确定如何调整 groupWhen 函数来防止这种情况(我仍然是 F# 菜鸟)。
这是代码的简化版本,仅显示将重新创建 WhosebugExcpetion::
的相关部分// This is the function created by Tomas Petricek where the WhosebugExcpetion is occuring
module Seq =
/// Iterates over elements of the input sequence and groups adjacent elements.
/// A new group is started when the specified predicate holds about the element
/// of the sequence (and at the beginning of the iteration).
///
/// For example:
/// Seq.groupWhen isOdd [3;3;2;4;1;2] = seq [[3]; [3; 2; 4]; [1; 2]]
let groupWhen f (input:seq<_>) = seq {
use en = input.GetEnumerator()
let running = ref true
// Generate a group starting with the current element. Stops generating
// when it founds element such that 'f en.Current' is 'true'
let rec group() =
[ yield en.Current
if en.MoveNext() then
if not (f en.Current) then yield! group() // *** Exception occurs here ***
else running := false ]
if en.MoveNext() then
// While there are still elements, start a new group
while running.Value do
yield group() |> Seq.ofList }
这是使用 Tomas 函数的代码要点:
module Extractor =
open System
open System.IO
open Microsoft.FSharp.Reflection
// ... elided a few functions include "isNextObject" which is
// a string -> bool (examines the line and returns true
// if the string meets the criteria to that we are at the
// start of the next inner file)
let writeFile outputDir file =
// ... write out "file" to the file system
// NOTE: file is a seq<string>
let writeFiles outputDir (files : seq<seq<_>>) =
files
|> Seq.iter (fun file -> writeFile outputDir file)
下面是控制台应用程序中使用这些函数的相关代码:
let lines = inputFile |> File.ReadLines
writeFiles outputDir (lines |> Seq.groupWhen isNextObject)
关于阻止 groupWhen 炸毁堆栈的正确方法有什么想法吗?我不确定如何将函数转换为使用累加器(或改为使用延续,我认为这是正确的术语)。
问题在于 group()
函数 return 是一个列表,这是一个急切求值的数据结构,这意味着每次调用 group()
它都必须 运行结束,将所有结果收集在一个列表中,并return列表。这意味着递归调用发生在同一评估中 - 即真正递归地 - 从而产生堆栈压力。
为了缓解这个问题,您可以将列表替换为惰性序列:
let rec group() = seq {
yield en.Current
if en.MoveNext() then
if not (f en.Current) then yield! group()
else running := false }
不过,我会考虑不那么激进的方法。这个例子很好地说明了为什么你应该避免自己做递归,而是求助于现成的折叠。
例如,从您的描述来看,Seq.windowed
似乎适合您。
在 F# 中很容易过度使用序列,IMO。您可能会不小心发生堆栈溢出,而且它们很慢。
所以(实际上并没有回答你的问题), 就我个人而言,我只会使用这样的方式折叠行的顺序:
let isNextObject line =
line = "---"
type State = {
fileIndex : int
filename: string
writer: System.IO.TextWriter
}
let makeFilename index =
sprintf "File%i" index
let closeFile (state:State) =
//state.writer.Close() // would use this in real code
state.writer.WriteLine("=== Closing {0} ===",state.filename)
let createFile index =
let newFilename = makeFilename index
let newWriter = System.Console.Out // dummy
newWriter.WriteLine("=== Creating {0} ===",newFilename)
// create new state with new writer
{fileIndex=index + 1; writer = newWriter; filename=newFilename }
let writeLine (state:State) line =
if isNextObject line then
/// finish old file here
closeFile state
/// create new file here and return updated state
createFile state.fileIndex
else
//write the line to the current file
state.writer.WriteLine(line)
// return the unchanged state
state
let processLines (lines: string seq) =
//setup
let initialState = createFile 1
// process the file
let finalState = lines |> Seq.fold writeLine initialState
// tidy up
closeFile finalState
(显然真实版本会使用文件而不是控制台)
是的,它很粗糙,但很容易推理,与 没有不愉快的惊喜。
这是一个测试:
processLines [
"a"; "b"
"---";"c"; "d"
"---";"e"; "f"
]
输出如下:
=== Creating File1 ===
a
b
=== Closing File1 ===
=== Creating File2 ===
c
d
=== Closing File2 ===
=== Creating File3 ===
e
f
=== Closing File3 ===