试图了解 linux 中的排序实用程序

Question

我有一个名为 a.csv 的文件。其中包含

在运行这个命令之后 sort -k1 -d -t "," a.csv

结果是

这是意想不到的，因为 10001 should come first than 100010

试图理解为什么会发生这种情况。但无法得到任何答案。

$ sort --version
sort (GNU coreutils) 8.13
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.

Answer 1

它按字母排序，而不是数字排序，所以“,”在“0”之前，即更像字典

Answer 2

-d 选项适用于 --dictionary-order:

-d, --dictionary-order consider only blanks and alphanumeric characters

但我认为您想改用 -n (--numeric-sort):

-n, --numeric-sort compare according to string numerical value

因此，将您的命令更改为如下所示：

sort -k1 -n -t "," a.csv

http://man7.org/linux/man-pages/man1/sort.1.html

Answer 3

其他一些回复假设这是数字排序与字典排序的问题。它不是，因为即使按字母顺序排序，问题中给出的输出也不正确。

答案

要获得正确的排序，您需要将 -k1 更改为 -k1,1:

$ sort -k1,1 -d -t "," a.csv
10000,3
100008,3
10001,6
100010,4
100010,5
100021,7

原因

-k 选项采用两个数字，开始和结束字段进行排序（即 -ks,e，其中 s 是开始，e 是结束）。默认情况下，结束字段是行的结尾。因此，-k1 等同于根本不提供 -k 选项。为了证明这一点，比较：

$ printf "1,a,1\n2,aa,2\n" | sort -k2 -t,
1,a,1
2,aa,2

与：

$ printf "1~a~1\n2~aa~2\n" | sort -k2 -t~
2~aa~2
1~a~1

第一个排序 a,1 在 aa,2 之前，而第二个排序 aa~2 在 a~1 之前，因为在 ASCII 中，, < a < ~.

因此，为了获得所需的行为，我们需要仅对一个字段进行排序。在您的情况下，这意味着使用 1 作为开始和结束字段，因此您指定 -k1,1。如果您使用 -k2,2 而不是 -k2 尝试上面的两个示例，您会发现在这两种情况下您得到相同（正确）的排序。

非常感谢来自 coreutils 邮件列表的 Eric 和 Assaf 指出了这一点。

Answer 4

排序是按字母顺序，而不是数字。将选项列表中的 -d 替换为 -n 以按数字排序。

Answer 5

您没有发现排序错误。您的使用错误是您使用了“-k1”（"set the key to the first field through the end of the line"）而不是“-k1,1”（"set the key to use only the first field"）。如果您使用 GNU 排序，--debug 选项将显示不同之处。只要键超出单个字段，分隔符就包含在键中。

试图了解 linux 中的排序实用程序

Trying to understand the sort utilty in linux

linux

csv

sorting

gnu-sort

答案

原因