函数生成的长度不正确的序列

Question

当 repl 变量设置为 false 时，为什么以下函数返回的序列长度不正确？

open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq<float>) (size : int) (repl : bool) =

    let n = data |> Seq.length

    // without replacement
    let rec generateIndex idx =
        let m = size - Seq.length(idx)
        match m > 0 with
        | true ->
            let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m 
            let idx = (Seq.append idx newIdx) |> Seq.distinct
            generateIndex idx
        | false -> 
            idx

    let sample =
        match repl with
        | true ->
            DiscreteUniform.Samples(0, n-1) 
            |> Seq.take size 
            |> Seq.map (fun index -> Seq.item index data)
        | false ->
            generateIndex (seq []) 
            |> Seq.map (fun index -> Seq.item index data)

    sample

运行函数...

let requested = 1000
let dat = Normal.Samples(0., 1.) |> Seq.take 10000
let resultlen = sample dat requested false |> Seq.length 
printfn "requested -> %A\nreturned -> %A" requested resultlen

结果长度错误。

> 
requested -> 1000
returned -> 998

> 
requested -> 1000
returned -> 1001

> 
requested -> 1000
returned -> 997

知道我犯了什么错误吗？

Answer 1

首先，我想对编码风格发表评论。然后我会解释为什么你的序列返回的长度不同。

在评论中，我提到用简单的 if ... then ... else 表达式替换 match (bool) with true -> ... | false -> ...，但我认为您正在使用的另一种编码风格可以改进。您写道：

let sample (various_parameters) =  // This is a function
    // Other code ...
    let sample = some_calculation  // This is a variable
    sample  // Return the variable

虽然 F# 允许您重用这样的名称，并且函数内部的名称将 "shadow" 函数外部的名称，但重用的名称具有 通常不是一个好主意与原始名称完全不同的类型。换句话说，这可能是个好主意：

let f (a : float option) =
    let a = match a with
            | None -> 0.0
            | Some value -> value
    // Now proceed, knowing that `a` has a real value even if had been None before

或者，因为以上正是 F# 为您提供的 defaultArg 用于：

let f (a : float option) =
    let a = defaultArg a 0.0
    // This does exactly the same thing as the previous snippet

在这里，我们在函数中使名称 a 引用与名为 a 的参数不同的类型：参数是 float option，而 a 在我们的函数里面是一个 float。但它们属于 "same" 类型——也就是说，"The caller may have specified a floating-point value or they may not" 和 "Now I definitely have a floating-point value" 之间的心理差异很小。但是 "The name sample is a function that takes three parameters" 和 "The name sample is a sequence of floats" 之间存在非常大的心理差距。我强烈建议使用 result 之类的名称作为函数要 return 的值，而不是 re-using 函数名称。

此外，这似乎不必要地冗长：

let result =
    match repl with
    | true ->
        DiscreteUniform.Samples(0, n-1) 
        |> Seq.take size 
        |> Seq.map (fun index -> Seq.item index data)
    | false ->
        generateIndex (seq []) 
        |> Seq.map (fun index -> Seq.item index data)

result

每当我发现自己在函数末尾写 "let result = (something) ; result" 时，我通常只想用 (something) 替换整个代码块。也就是说，上面的代码片段可能会变成：

match repl with
| true ->
    DiscreteUniform.Samples(0, n-1) 
    |> Seq.take size 
    |> Seq.map (fun index -> Seq.item index data)
| false ->
    generateIndex (seq []) 
    |> Seq.map (fun index -> Seq.item index data)

这又可以用 if...then...else 表达式替换：

if repl then
    DiscreteUniform.Samples(0, n-1) 
    |> Seq.take size 
    |> Seq.map (fun index -> Seq.item index data)
else
    generateIndex (seq []) 
    |> Seq.map (fun index -> Seq.item index data)

这是您代码中的最后一个表达式。换句话说，我可能会按如下方式重写您的函数（仅更改样式，不更改逻辑）：

open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq<float>) (size : int) (repl : bool) =

    let n = data |> Seq.length

    // without replacement
    let rec generateIndex idx =
        let m = size - Seq.length(idx)
        if m > 0 then
            let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m 
            let idx = (Seq.append idx newIdx) |> Seq.distinct
            generateIndex idx
        else
            idx

    if repl then
        DiscreteUniform.Samples(0, n-1) 
        |> Seq.take size 
        |> Seq.map (fun index -> Seq.item index data)
    else
        generateIndex (seq []) 
        |> Seq.map (fun index -> Seq.item index data)

如果我能弄清楚为什么你的序列长度错误，我也会用这些信息更新这个答案。

更新： 好的，我想我明白了您的 generateIndex 函数中发生了什么，这给您带来了意想不到的结果。有两件事让你绊倒：一是序列惰性，二是运行domness.

我将你的 generateIndex 函数复制到 VS Code 中并添加了一些 printfn 语句来查看发生了什么。先是代码我运行，然后是结果：

let rec generateIndex n size idx =
    let m = size - Seq.length(idx)
    printfn "m = %d" m
    match m > 0 with
    | true ->
        let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
        printfn "Generating newIdx as %A" (List.ofSeq newIdx)
        let idx = (Seq.append idx newIdx) |> Seq.distinct
        printfn "Now idx is %A" (List.ofSeq idx)
        generateIndex n size idx
    | false -> 
        printfn "Done, returning %A" (List.ofSeq idx)
        idx

所有这些 List.ofSeq idx 调用都是为了让 F# Interactive 在我打印出来时打印 seq 的四个以上项目（默认情况下，如果您尝试使用 %A 打印 seq，它只会打印出四个值，然后如果 seq 中有更多可用值，则打印一个省略号）。此外，我将 n 和 size 变成参数（我不会在调用之间更改）以便我可以轻松地对其进行测试。然后我将其称为 generateIndex 100 5 (seq []) 并得到以下结果：

m = 5
Generating newIdx as [74; 76; 97; 78; 31]
Now idx is [68; 28; 65; 58; 82]
m = 0
Done, returning [37; 58; 24; 48; 49]
val it : seq<int> = seq [12; 69; 97; 38; ...]

看到数字如何不断变化了吗？这是我的第一个线索，表明有事情发生了。看，seq 是 lazy。除非必须，否则他们不会评估其内容。您不应将 seq 视为数字列表。相反，将其视为一个 生成器 ，当被要求输入数字时，它将根据某种规则生成它们。在您的情况下，规则是 "Choose random integers between 0 and n-1, then take m of those numbers"。关于 seqs 的另一件事是它们不缓存它们的内容（尽管有一个可用的 Seq.cache 函数可以缓存它们的内容）。因此，如果你有一个基于运行dom 数字生成器的 seq，它的结果每次都会不同，正如你在我的输出中看到的那样。当我打印出 newIdx 时，它打印出 [74; 76; 97； 78; 31]，但是当我将它附加到一个空序列时，结果打印为 [68; 28; 65； 58; 82].

为什么会有这种差异？因为 Seq.append 不会强制求值 。它只是创建一个新的 seq 其规则是 "take all items from the first seq, then when that one exhausts, take all items from the second seq. And when that one exhausts, end." 并且 Seq.distinct 也不会强制评估；它只是创建一个新的 seq 其规则是 "take the items from the seq handed to you, and start handing them out when asked. But memorize them as you go, and if you've handed one of them out before, don't hand it out again." 所以你在调用 generateIdx 之间传递的是一个对象， 评估时 ，将选择一组 0 到 n-1 之间的运行dom 数字（在我的简单情况下，0 到 100 之间），然后将该集合缩减为一组不同的数字。

现在，事情是这样的。每次计算 seq 时，它都会从头开始：首先调用 DiscreteUniform.Samples(0, n-1) 生成无限的运行dom 数字流，然后从该流中选择 m 数字，然后丢弃任何重复项。（我暂时忽略 Seq.append，因为它会造成不必要的心理复杂性，而且它实际上并不是错误的一部分）。现在，在函数的每个 go-round 的开头，您检查序列的长度，这确实会导致对其进行评估。这意味着它选择（在我的示例代码的情况下）5 运行dom 数字在 0 和 99 之间，然后确保它们都是不同的。如果它们都是不同的，那么 m = 0 函数将退出，returning...不是数字列表，bt seq 对象。当评估 seq 对象时，它将从头开始，选择一组不同 5 运行dom 数字，然后丢弃任何重复项。因此，仍然有可能这组 5 个数字中至少有一个最终会重复，因为其长度被测试的序列（我们知道不包含重复项，否则 m 会大于 0 ) 不是 returned 的序列。 returned 的序列有 1.0 * 0.99 * 0.98 * 0.97 * 0.96 的机会不包含任何重复项，大约为 0.9035。所以有 just-under-10% 的机会，即使你检查了 Seq.length 并且它是 5，returned seq 的长度最终还是 4 --因为它选择了一组不同的一组运行dom 号码，而不是你检查的号码。

为了证明这一点，我再次运行该函数，这次只选择 4 个数字，以便在 F# Interactive 提示符下完整显示结果。我的 generateIndex 100 4 (seq []) 的运行产生了以下输出：

m = 4
Generating newIdx as [36; 63; 97; 31]
Now idx is [39; 93; 53; 94]
m = 0
Done, returning [47; 94; 34]
val it : seq<int> = seq [48; 24; 14; 68]

注意当我打印 "Done, returning (value of idx)" 时它只有 3 个值吗？即使它最终 returned 4 个值（因为它为实际结果选择了不同的运行dom 数字选择，并且该选择没有重复），这证明了问题。

顺便说一下，您的函数有一个 other 问题，那就是它比需要的速度慢得多。在某些情况下，函数 Seq.item 必须运行从头开始 遍历序列 以便选择序列的第 n 项.最好在函数开始时将数据存储在数组中 (let arrData = data |> Array.ofSeq)，然后替换

        |> Seq.map (fun index -> Seq.item index data)

和

        |> Seq.map (fun index -> arrData.[index])

数组查找是在恒定时间内完成的，因此您的示例函数从 O(N^2) 降低到 O(N)。

TL;DR：在之前使用Seq.distinctm 它的值和错误将消失。您可以用一个简单的 DiscreteUniform.Samples(0, n-1) |> Seq.distinct |> Seq.take size 替换整个 generateIdx 函数。（并使用数组进行数据查找，以便您的函数运行更快）。换句话说，这是 ~~final~~ almost-final 版本，我将如何重写您的代码：

let sample (data : seq<float>) (size : int) (repl : bool) = let arrData = data |> Array.ofSeq let n = arrData |> Array.length if repl then DiscreteUniform.Samples(0, n-1) |> Seq.take size |> Seq.map (fun index -> arrData.[index]) else DiscreteUniform.Samples(0, n-1) |> Seq.distinct |> Seq.take size |> Seq.map (fun index -> arrData.[index])

就是这样！简单易懂，而且（据我所知）bug-free.

编辑： ...但不是完全干的，因为在那个 "final" 版本中仍然有一些重复的代码。（感谢 CaringDev 在下面的评论中指出）。 Seq.take size |> Seq.map 在 if 表达式的两个运行分支中重复出现，因此有一种方法可以简化该表达式。我们可以这样做：

let randomIndices = if repl then DiscreteUniform.Samples(0, n-1) else DiscreteUniform.Samples(0, n-1) |> Seq.distinct randomIndices |> Seq.take size |> Seq.map (fun index -> arrData.[index])

所以这是我的建议的 truly-final 版本：

let sample (data : seq<float>) (size : int) (repl : bool) = let arrData = data |> Array.ofSeq let n = arrData |> Array.length let randomIndices = if repl then DiscreteUniform.Samples(0, n-1) else DiscreteUniform.Samples(0, n-1) |> Seq.distinct randomIndices |> Seq.take size |> Seq.map (fun index -> arrData.[index])

函数生成的长度不正确的序列

Sequence of incorrect length generated by function

f#

lazy-evaluation

lazy-sequences