当未提供 --stable 选项时，sort -n 是否可以预测地处理关系？如果有，怎么办？

Question

这里看起来 space 在两行中的 3 之后打破了数字排序并让字母排序开始，因此 11<2 :

$ echo -e '3 2\n3 11' | sort -n
3 11
3 2

在man sort，我阅读了

       -s, --stable
              stabilize sort by disabling last-resort comparison

这意味着没有 -s 最后的比较是完成（关系之间，因为 -s 不影响非关系）。

所以问题是：这个最后的比较是如何完成的？如果需要回答这个问题，欢迎参考源代码。

This answer Unix 从实验中推断出关系的排序是字典序的。

standard/POSIX 对此有何评论？

Answer 1

Here it looks like the space after the 3 in both rows breaks the numerical sorting and lets the alphabetic sorting kick in

sort -n 不是 sort -n -k1,1 -k2,2。 sort -n 将整行（不是字段！）解释为数字，如 atoi("3 11") 给出 3 .然后对这些数字进行排序。因为 sort_them(atoi("3 11"), atoi("3 2")) 是未排序的，因为都是数字 3，最后的比较排序开始了。

how is this last-resort comparison accomplished?

这个想法是将整行进行比较，就好像 strcmp 或类似的（即 strcoll）。因为 1 在 2 之前，所以 strcmp("3 11", "3 2") 将 3 11 排在第一位。没有考虑选项，-n没有考虑。

A reference to the source code would be welcome, if necessary to answer the question.

在 GNU 排序中实际上是 xmemcoll0 在 coreutils sort.c#L2653 in compare (struct line const *a, struct line const *b) 中考虑整理，并且当 LC_COLLATE 未设置时有 memcmp 作为后备。

我在 openbsd 中看到排序在 openbsd/sort/coll.c#L528 str_list_coll(struct bwstring *str1, struct sort_list_item **ss2) but also in list_coll_offset() 附近，如果所有键都比较相等，则调用 top_level_str_coll 对整行进行排序。

Does the standard/POSIX say anything about this?

如果“this”指的是稳定排序和最后手段比较，那么当然可以。让我们从 POSIX sort 重点复制我的整个段落：

Comparisons shall be based on one or more sort keys extracted from each line of input (or, if no sort keys are specified, the entire line up to, but not including, the terminating ), and shall be performed using the collating sequence of the current locale. If this collating sequence does not have a total ordering of all characters (see XBD LC_COLLATE), any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.

和

Implementations are encouraged to perform the recommended further byte-by-byte comparison of lines that collate equally, even though this may affect efficiency. The impact on efficiency can be mitigated by only performing the additional comparison if the current locale's collating sequence does not have a total ordering of all characters (if the implementation provides a way to query this) or by only performing the additional comparison if the locale name associated with the LC_COLLATE category has an '@' modifier in the name (since locales without an '@' modifier should have a total ordering of all characters - see XBD LC_COLLATE). Note that if the implementation provides a stable sort option as an extension (usually -s), the additional comparison should not be performed when this option has been specified.

Answer 2

问题： 最后比较是怎么做的？

这在 GNU coreutils 的文档中很快得到了回答：

A pair of lines is compared as follows: sort compares each pair of fields (see --key), in the order specified on the command line, according to the associated ordering options, until a difference is found or no fields are left. If no key fields are specified, sort uses a default key of the entire line. Finally, as a last resort when all keys compare equal, sort compares entire lines as if no ordering options other than --reverse (-r) were specified. The --stable (-s) option disables this last-resort comparison so that lines in which all fields compare equal are left in their original relative order. The --unique (-u) option also disables the last-resort comparison.

Unless otherwise specified, all comparisons use the character collating sequence specified by the LC_COLLATE

_{source: Sort Invocation GNU Coreutils}

这意味着最后会按照LC_COLLATE的排序顺序排序，即按字典顺序（大部分）

POSIX，另一方面添加了一个更严格的最终超最后手段选项。

If this collating sequence does not have a total ordering of all characters (see XBD LC_COLLATE), any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.

_{source: Sort POSIX standard}

我不确定这是否在 GNU 排序中实现，因为这不是必需的。尽管如此，POSIX 强烈推荐它（参见 Rationale last paragraph）

对于 OP 来说这意味着什么？

对键定义有一个令人不安的误解。假设你做了类似

$ sort --option -k1,3 file

通常理解 sort 将首先对字段 1 进行排序，然后使用 --option 对字段 2 进行排序，最后对字段 3 进行排序。这是不正确的。它将使用被定义为由字段 1 到 3 组成的子字符串的键。如果两行相等地整理，sort 将执行最后的选择（见前面）

aa  bb cc xxxxxxxx
---------           <<< rule1: according to the key
------------------  <<< rule2: lexicographical sort (last resort)

使用 GNU 排序，您可以看到排序使用了哪个子字符串。这是通过 --debug 选项完成的。在这里您可以看到 3 个简单案例之间的区别：

# Sort lexicographically with full line
# -------------------------------------------------------------------
$ echo -e "ab c d\nefg h i" | sort --debug
sort: using ?en_GB.UTF-8? sorting rules
ab c d
______
efg h i
_______
# -------------------------------------------------------------------
# Sort lexicographically with the substring formed by field 1 and 2
# -------------------------------------------------------------------
$ echo -e "ab c d\nefg h i" | sort -k1,2 --debug
sort: using ?en_GB.UTF-8? sorting rules
sort: leading blanks are significant in key 1; consider also specifying 'b'
ab c d
____
______
efg h i
_____
_______
# -------------------------------------------------------------------
# Sort lexicographically with field 1 followed by field 2
# -------------------------------------------------------------------
$ echo -e "ab c d\nefg h i" | sort -k1,1 -k2,2 --debug
sort: using ?en_GB.UTF-8? sorting rules
sort: leading blanks are significant in key 1; consider also specifying 'b'
sort: leading blanks are significant in key 2; consider also specifying 'b'
ab c d
__
  __
______
efg h i
___
   __
_______

当您进行数字排序时（使用 -n 或 -g），sort 将尝试从 key 中提取数字（1234abc 导致 1234）并使用该数字进行排序。

# Sort numerically with full line
# -------------------------------------------------------------------
$ echo -e "3a 11a\n3b 2b" | sort -n --debug
sort: using ?en_GB.UTF-8? sorting rules
3a 11a
_         # numeric on full line
______    # lexicographically on full line  (last resort)
3b 2b
_         # numeric on full line
_____     # lexicographically on full line  (last resort)
# -------------------------------------------------------------------
# Sort numerically with field 1 then field 2
# -------------------------------------------------------------------
$ echo -e "3a 11a\n3b 2b" | sort -n -k1,1 -k2,2 --debug
sort: using ?en_GB.UTF-8? sorting rules
3b 2b
_         # numeric on field 1
   _      # numeric on field 2
_____     # lexicographically on full line  (last resort)
3a 11a
_         # numeric on field 1
   __     # numeric on field 2
______    # lexicographically on full line  (last resort)

正如您在这两种情况下注意到的，即使第一个字段可以按字典顺序排序 3a < 3b，它也会被忽略，因为我们只从键中选择数字。

当未提供 --stable 选项时，sort -n 是否可以预测地处理关系？如果有，怎么办？

Does sort -n handle ties predictably when the --stable option is NOT provided? If it does, how?

sorting

bash

lexicographic

stable-sort

language-lawyer