BASH 中的 <uniq> 和 R 中的 <unique> 有什么区别？

Question

我在 BASH 中使用 uniq 与在 R 中使用 unique 得到不同的结果。我的 df 看起来像（超过 9000 行）：

samples read_seq
ccd_x29 GCATTGGT
ccd_x29 GCATTGGT
ccd_x29 GCATTGGT
ccd_x20 GCCCGGCTAG
ccd_x19 GCATTGGTGGTT
ccd_x19 GCATTGGTGGTT

在 bash uniq 之后我得到 8811 行，在 df <- unique(df) 之后我得到 8803 行。

这是什么原因造成的？

Answer 1

来自R docs：

Note that unlike the Unix command uniq this omits duplicated and not just repeated elements/rows. That is, an element is omitted if it is equal to any previous element and not just if it is equal the immediately previous one. (For the latter, see rle).

Answer 2

如果我们只想省略前面重复的元素，一个选项是 rleid from data.table

library(data.table)
library(dplyr)
df %>%
    mutate(new = rleid(samples, read_seq)) %>%
    distinct(new, .keep_all = TRUE) %>%
    select(-new)

BASH 中的 <uniq> 和 R 中的 <unique> 有什么区别？

What is the difference between <uniq> in BASH and <unique> in R?

bash

r

unique

count

uniq