data.frame 和 data.table 中的 order()

order() in data.frame and data.table

如果我在 data.framedata.table 中使用 order(),我会得到不同的结果。例如:

A <- data.frame(one=c("k"),two=c("3_28","31_60","48_68"))
B <- as.data.table(A)

A[order(A$one,A$two),]
  one   two
1   k  3_28
2   k 31_60
3   k 48_68


B[order(B$one, B$two),]
   one   two
1:   k 31_60
2:   k  3_28
3:   k 48_68

我必须承认这有点令人震惊,因为多年来我一直假设 order() 来自 data.framedata.table 的结果相同。我想我需要检查很多代码!

有没有办法确保 order()data.framedata.table 中给出相同的结果?

如果这种行为差异已经众所周知,并且只是我无知的一个例子,我深表歉意。

data.table 操作内部使用时,order(..) 使用 data.table:::forder。根据 Introduction to data.table 小插图:

order() is internally optimised

  • We can use "-" on a character columns within the frame of a data.table to sort in decreasing order.

  • In addition, order(...) within the frame of a data.table uses data.table's internal fast radix order forder(). This sort provided such a compelling improvement over R's base::order that the R project adopted the data.table algorithm as its default sort in 2016 for R 3.3.0, see ?sort and the R Release NEWS.

区别的关键在于它使用了“快速基数顺序”。但是,如果你看到 base::order,它有一个参数 method= which

  method: the method to be used: partial matches are allowed.  The
          default ('"auto"') implies '"radix"' for short numeric
          vectors, integer vectors, logical vectors and factors.
          Otherwise, it implies '"shell"'.  For details of methods
          '"shell"', '"quick"', and '"radix"', see the help for 'sort'.

由于您的 data.table 的第二列不是 numericintegerlogicalfactor 之一,因此 base::order使用"shell"排序方式,结果不同

但是,如果我们强制 base::order 使用 method="radix",我们会得到相同的结果。

order(A$two)
# [1] 1 2 3
order(A$two, method="radix")
# [1] 2 1 3

A[order(A$one, A$two, method = "radix"),]
#   one   two
# 2   k 31_60
# 1   k  3_28
# 3   k 48_68

您可以使用 base::order:

影响相同的顺序
B[base::order(B$one,B$two),]
#       one    two
#    <char> <char>
# 1:      k   3_28
# 2:      k  31_60
# 3:      k  48_68