覆盖 data.table 键顺序导致不正确的合并结果

Question

在下面的示例中，我在 data.table 上使用 dplyr::arrange 和一个键。这将覆盖该列的排序：

x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setkey(x, "a")

# lose order on datatable key
x <- dplyr::arrange(x, b)

y <- data.table(a = sample(1000:1100), f = c(letters, NA), g = c("AA", "BB", NA, NA, NA, NA))
setkey(y, "a")

res <- merge(x, y, by = c("a"), all.x = TRUE)
# try merge with key removed
res2 <- merge(x %>% as.data.frame() %>% as.data.table(), y, by = c("a"), all.x = TRUE)

# merge results are inconsistent
identical(res, res2)

我可以看到，如果我用 x <- x[order(b)] 排序，我会保持键的排序并且结果是一致的。

我不确定为什么我不能使用 dplyr::arrange 以及排序键与合并有什么关系。任何见解将不胜感激。

Answer 1

问题在于，使用 dplyr::arrange(x, b) 时，您不会从 data.table 中删除 sorted 属性，这与使用 x <- x[order(b)] 或 setorder(x, "b") 不同。

data.table 方法是首先使用 setorder，例如

library(data.table)
x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setorder(x, "b", "a", na.last=TRUE)

data.table 上具有 key 的连接的错误结果是一个已知错误（另请参阅 [=13= 中的 #5361 ] 错误跟踪器）。

覆盖 data.table 键顺序导致不正确的合并结果

Overriding data.table key order causes incorrect merge results

r

dplyr

data.table