按多个标准进行惯用惰性排序

Question

在Ruby中，最常见的按多个标准排序的方法是使用sort_by，排序函数返回一个数组，其中包含与每个排序标准对应的值，按重要性递减的顺序排列, 例如:

Dir["*"].sort_by { |f| [test(?s, f) || 0, test(?M, f), f] }

将按大小对目录条目进行排序，然后按 mtime，最后按文件名。这是有效的，因为它使用 Schwartzian transform 只计算每个文件的大小和 mtime 一次，而不是每次比较一次。然而，它并不是真正的懒惰，因为它计算每个文件的 mtime，但是如果（比如说）目录中的每个文件都有不同的大小，那么就没有必要计算 any mtimes.

在这种情况下这不是一个大问题，因为在查找大小后立即查找 mtime 应该是有效的，因为内核级别的缓存（例如 Linux 上的 IIRC 它们都来自stat(2) 系统调用），如果 Ruby 也有自己的优化，我不会感到惊讶。但是想象一下，如果第二个标准不是 mtime，而是（比方说）文件中某个字符串出现的次数，那么所讨论的文件就很大。在这种情况下，您真的想要惰性评估，以避免在按大小排序就足够的情况下读取整个这些巨大的文件。

在撰写本文时，Algorithm Implementation/Sorting/Schwartzian transform 的 Wikibooks 条目建议采用以下解决方案：

sorted_files =
  Dir["*"].                         # Get all files
    # compute tuples of name, size, modtime
    collect{|f| [f, test(?s, f), test(?M, f)]}.
    sort {|a, b|                    # sort
      a[1] <=> b[1] or              #   -- by increasing size
      b[2] <=> a[2] or              #   -- by age descending
      a[0] <=> b[0]                 #   -- by name
    }.collect{|a| a[0]}             # extract original name

这种方法是从Perl中复制过来的，其中

sort {
       $a->[1] <=> $b->[1] # sort first numerically by size (smallest first)
    or $b->[2] <=> $a->[2] # then numerically descending by modtime age (oldest first)
    or $a->[0] cmp $b->[0] # then stringwise by original name
  }

工作得很好，因为 Perl 有一个怪癖，其中 0 or $foo 计算为 $foo。但在 Ruby 中，它被破坏了，因为 0 or foo 的计算结果为 0。所以实际上，Wikibooks 实现完全忽略了 mtimes 和文件名，并且只按大小排序。我已经清除了我的 Wikibooks 帐户以便我可以解决这个问题，但我想知道：在 Ruby 中组合多个 <=> 宇宙飞船操作员比较结果的最干净的方法是什么？

我将举一个具体的例子来澄清这个问题。假设我们有两种类型的评估，在排序期间可能需要将其作为标准。第一个比较便宜：

def size(a)
    # get the size of file `a`, and if we're feeling keen,
    # memoize the results
    ...
end

第二贵：

def matches(a)
    # count the number of occurrences of a string
    # in file `a`, which could be a large file, and
    # memoize the results
    ...
end

我们想先按大小升序排序，然后按匹配数降序。我们不能使用 Schwartzian 变换，因为那样会非惰性地对每个项目调用 matches()。

我们可以像这样定义一个助手

def nil_if_equal(result)
  result == 0 ? nil : result
end

然后做：

sort {|a, b|
  nil_if_equal(size(a) <=> size(b)) or
  matches(b) <=> matches(a)
}

如果有 n 个排序标准，那么您需要 n-1 次 nil_if_equal 调用，因为只有最后一个排序标准不需要它。

那么有没有比这更惯用的方法可以避免 nil_if_equal 的需要？

Answer 1

不知道它是多么地道，但这里有一种再次使用 sort_by 的方法。代替例如

['bab', 'foo', 'so', 'bar'].sort_by { |s| [s.size, count_a(s), count_b(s)] }

这样做是为了让 count_a(s) 和 count_b(s) 变得懒惰和记住：

['bab', 'foo', 'so', 'bar'].sort_by { |s| [s.size, lazy{count_a(s)}, lazy{count_b(s)}] }

我的 lazy 使该块表现得像它产生的值的惰性和记忆版本。

演示输出，表明我们只计算必要的（即，不计算在 'so' 中，因为它具有唯一的大小，不计算 'b' 在 'foo' 中，因为它的 'a'-count 在大小为 3 的字符串中是唯一的):

Counting 'a' in 'bab'.
Counting 'a' in 'foo'.
Counting 'a' in 'bar'.
Counting 'b' in 'bab'.
Counting 'b' in 'bar'.
["so", "foo", "bar", "bab"]

演示代码：

def lazy(&block)
  def block.value
    (@value ||= [self.yield])[0]
  end
  def block.<=>(other)
    value <=> other.value
  end
  block
end

def count_a(s)
  puts "Counting 'a' in '#{s}'."
  s.count('a')
end

def count_b(s)
  puts "Counting 'b' in '#{s}'."
  s.count('b')
end

p ['bab', 'foo', 'so', 'bar'].sort_by { |s| [s.size, lazy{count_a(s)}, lazy{count_b(s)}] }

另一种实现 value 记忆的方法：如果它被调用，它会立即 将自身替换 为仅返回存储值的方法：

  def block.value
    def self.value; @value end
    @value = self.yield
  end

按多个标准进行惯用惰性排序

Idiomatic lazy sorting by multiple criteria

ruby

sorting

performance

memoization

lazy-evaluation