ruby 字母数字排序未按预期工作

ruby alphanumeric sort not working as expected

给定以下数组:

y = %w[A1 A2 B5 B12 A6 A8 B10 B3 B4 B8]
=> ["A1", "A2", "B5", "B12", "A6", "A8", "B10", "B3", "B4", "B8"]

预期的排序数组为:

=> ["A1", "A2", "A6", "A8", "B3", "B4", "B5", "B8", "B10", "B12"]

使用以下(原始)排序,我得到:

irb(main):2557:0> y.sort{|a,b| puts "%s <=> %s = %s\n" % [a, b, a <=> b]; a <=> b}
A1 <=> A8 = -1
A8 <=> B8 = -1
A2 <=> A8 = -1
B5 <=> A8 = 1
B4 <=> A8 = 1
B3 <=> A8 = 1
B10 <=> A8 = 1
B12 <=> A8 = 1
A6 <=> A8 = -1
A1 <=> A2 = -1
A2 <=> A6 = -1
B12 <=> B3 = -1
B3 <=> B8 = -1
B5 <=> B3 = 1
B4 <=> B3 = 1
B10 <=> B3 = -1  # this appears to be wrong, looks like 1 is being compared, not 10.
B12 <=> B10 = 1
B5 <=> B4 = 1
B4 <=> B8 = -1
B5 <=> B8 = -1
=> ["A1", "A2", "A6", "A8", "B10", "B12", "B3", "B4", "B5", "B8"]

...这显然不是我想要的。我知道我可以尝试先拆分 alpha,然后对数字进行排序,但似乎我不应该那样做。

可能的重要警告:我们现在无法使用 Ruby 1.8.7 :( 但即使 Ruby 2.0.0 也在做同样的事情。我在这里错过了什么?

建议?

需要自然排序或字典排序,而不是标准的基于字符值的排序。像这些宝石这样的东西将是一个起点:https://github.com/dogweather/naturally, https://github.com/johnnyshields/naturalsort

人类将 "A2" 之类的字符串视为 "A" 后跟数字 2,然后对字符串部分使用字符串排序,对数字部分使用数字排序进行排序。标准 sort() 使用字符值排序将字符串视为字符序列,而不管字符是什么。所以 sort() "A10" 和 "A2" 看起来像 [ 'A', '1', '0' ] 和 [ 'A', '2' ],因为 ' 1' 在 '2' 之前排序,并且后面的字符不能更改该顺序 "A10" 因此在 "A2" 之前排序。对于人类,相同的字符串看起来像 [ "A", 10 ] 和 [ "A", 2 ], 10 在 2 之后排序,所以我们得到相反的结果。可以操纵字符串以使基于字符值的 sort() 产生预期结果,方法是使数字部分固定宽度并在左侧填充零以避免嵌入空格,使 "A2" 转进入 "A02",它使用标准 sort().

在 "A10" 之前排序

您正在对字符串进行排序。字符串像字符串一样排序,而不像数字。如果你想像数字一样排序,那么你应该对数字而不是字符串进行排序。字符串 'B10' 按字典顺序小于字符串 'B3',这不是 Ruby 独有的东西,甚至不是编程独有的东西,这就是按字典顺序对一段文本进行排序的方式几乎无处不在, 在编程、数据库、词典、词典、电话簿等方面

您应该将字符串拆分为数字和非数字部分,并将数字部分转换为数字。数组排序是字典顺序的,所以这将最终排序完全正确:

y.sort_by {|s| # use `sort_by` for a keyed sort, not `sort`
  s.
    split(/(\d+)/). # split numeric parts from non-numeric
    map {|s| # the below parses numeric parts as decimals, ignores the rest
      begin Integer(s, 10); rescue ArgumentError; s end }}
#=> ["A1", "A2", "A6", "A8", "B3", "B4", "B5", "B8", "B10", "B12"]

如果您知道您的号码中的最大位数是多少,您也可以在比较期间为您的号码添加前缀 0

y.sort_by { |string| string.gsub(/\d+/) { |digits| format('%02d', digits.to_i) } }
#=> ["A1", "A2", "A6", "A8", "B3", "B4", "B5", "B8", "B10", "B12"]

这里'%02d'指定如下,%表示值的格式,0则指定给数字加上0前缀,2 指定数字的总长度,d 指定您希望以小数(基数 10)输出。您可以找到更多信息 here.

这意味着 'A1' 将转换为 'A01''B8' 将变为 'B08''B12' 将保留 'B12',因为它已经有 2 位数。这仅在比较期间使用。

这里有几种方法可以做到这一点。

arr = ["A1", "A2", "B5", "B12", "A6", "AB12", "A8", "B10", "B3", "B4",
       "B8", "AB2"]

按 2 元素数组排序

arr.sort_by { |s| [s[/\D+/], s[/\d+/].to_i] }
  #=> ["A1", "A2", "A6", "A8", "AB2", "AB12", "B3", "B4", "B5", "B8",
  #    "B10", "B12"] 

这类似于@Jorg 的解决方案,只是我分别计算了比较数组的两个元素,而不是将字符串分成两部分并将后者转换为整数。

使用

Enumerable#sort_by compares each pair of elements of arr with the spaceship method, <=>. As the elements being compared are arrays, the method Array#<=>。请特别参阅该文档的第三段。

sort_by 比较以下 2 元素数组:

arr.each { |s| puts "%s-> [%s, %d]" %
  ["\"#{s}\"".ljust(7), "\"#{s[/\D+/]}\"".ljust(4), s[/\d+/].to_i] }

"A1"   -> ["A" , 1]
"A2"   -> ["A" , 2]
"B5"   -> ["B" , 5]
"B12"  -> ["B" , 12]
"A6"   -> ["A" , 6]
"AB12" -> ["AB", 12]
"A8"   -> ["A" , 8]
"B10"  -> ["B" , 10]
"B3"   -> ["B" , 3]
"B4"   -> ["B" , 4]
"B8"   -> ["B" , 8]
"AB2"  -> ["AB", 2]

在字符串的字母数字部分之间插入空格

max_len = arr.max_by(&:size).size
  #=> 4
arr.sort_by { |s| "%s%s%d" % [s[/\D+/], " "*(max_len-s.size), s[/\d+/].to_i] }
  #=> ["A1", "A2", "A6", "A8", "AB2", "AB12", "B3", "B4", "B5", "B8",
  #    "B10", "B12"]

此处sort_by比较以下字符串:

arr.each { |s| puts "%s-> \"%s\"" %
  ["\"#{s}\"".ljust(7), s[/\D+/] + " "*(max_len-s.size) + s[/\d+/]] }

"A1"   -> "A  1"
"A2"   -> "A  2"
"B5"   -> "B  5"
"B12"  -> "B 12"
"A6"   -> "A  6"
"AB12" -> "AB12"
"A8"   -> "A  8"
"B10"  -> "B 10"
"B3"   -> "B  3"
"B4"   -> "B  4"
"B8,"  -> "B 8"
"AB2"  -> "AB 2"