r中重复行的序号
Sequence number for duplicate rows in r
我有一个包含数字和字符列的数据框,其中有些行是重复的。为了区分这些行,我想向重复行的每个“块”添加一个来自 1:n 的序列号作为新列(在我的示例中称为“duplicateID”)。
我的数据集如下所示:
a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
df1 <-data.frame(a,b)
>df1
a b
1 one 3.5
2 one 3.5
3 one 3.5
4 one 2.5
5 two 3.5
6 two 3.5
7 three 1.0
8 four 2.2
9 four 7.0
10 four 7.0
期望的输出是:
a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
duplicateID = c(1, 2, 3, 1, 1, 2, 1, 1, 1, 2)
df2 <-data.frame(a,b,duplicateID)
>df2
a b duplicateID
1 one 3.5 1
2 one 3.5 2
3 one 3.5 3
4 one 2.5 1
5 two 3.5 1
6 two 3.5 2
7 three 1.0 1
8 four 2.2 1
9 four 7.0 1
10 four 7.0 2
提前谢谢大家!
使用 dplyr
实现此目的的一种方法:
library(dplyr)
df1 %>%
# build grouping by combination of variables
dplyr::group_by(a, b) %>%
# add row number which works per group due to prior grouping
dplyr::mutate(duplicateID = dplyr::row_number()) %>%
# ungroup to prevent unexpected behaviour down stream
dplyr::ungroup()
# A tibble: 10 x 3
a b duplicateID
<chr> <dbl> <int>
1 one 3.5 1
2 one 3.5 2
3 one 3.5 3
4 one 2.5 1
5 two 3.5 1
6 two 3.5 2
7 three 1 1
8 four 2.2 1
9 four 7 1
10 four 7 2
可能不如 dplyr 快(当然 data.table 也有选项)但在 base R 中,您可以使用带有“seq_along”的“ave”函数实现此目的:
a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
df1 <-data.frame(a,b)
df1$dupID = NA
df1$dupID = with(df1,ave(dupID,b,a,FUN = seq_along))
我们可以使用rowid
library(data.table)
setDT(df1)[, dupID := rowid(a, b)]
-输出
> df1
a b dupID
1: one 3.5 1
2: one 3.5 2
3: one 3.5 3
4: one 2.5 1
5: two 3.5 1
6: two 3.5 2
7: three 1.0 1
8: four 2.2 1
9: four 7.0 1
10: four 7.0 2
我有一个包含数字和字符列的数据框,其中有些行是重复的。为了区分这些行,我想向重复行的每个“块”添加一个来自 1:n 的序列号作为新列(在我的示例中称为“duplicateID”)。
我的数据集如下所示:
a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
df1 <-data.frame(a,b)
>df1
a b
1 one 3.5
2 one 3.5
3 one 3.5
4 one 2.5
5 two 3.5
6 two 3.5
7 three 1.0
8 four 2.2
9 four 7.0
10 four 7.0
期望的输出是:
a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
duplicateID = c(1, 2, 3, 1, 1, 2, 1, 1, 1, 2)
df2 <-data.frame(a,b,duplicateID)
>df2
a b duplicateID
1 one 3.5 1
2 one 3.5 2
3 one 3.5 3
4 one 2.5 1
5 two 3.5 1
6 two 3.5 2
7 three 1.0 1
8 four 2.2 1
9 four 7.0 1
10 four 7.0 2
提前谢谢大家!
使用 dplyr
实现此目的的一种方法:
library(dplyr)
df1 %>%
# build grouping by combination of variables
dplyr::group_by(a, b) %>%
# add row number which works per group due to prior grouping
dplyr::mutate(duplicateID = dplyr::row_number()) %>%
# ungroup to prevent unexpected behaviour down stream
dplyr::ungroup()
# A tibble: 10 x 3
a b duplicateID
<chr> <dbl> <int>
1 one 3.5 1
2 one 3.5 2
3 one 3.5 3
4 one 2.5 1
5 two 3.5 1
6 two 3.5 2
7 three 1 1
8 four 2.2 1
9 four 7 1
10 four 7 2
可能不如 dplyr 快(当然 data.table 也有选项)但在 base R 中,您可以使用带有“seq_along”的“ave”函数实现此目的:
a = c("one", "one", "one", "one", "two", "two", "three", "four", "four", "four")
b = c(3.5, 3.5, 3.5, 2.5, 3.5, 3.5, 1, 2.2, 7, 7)
df1 <-data.frame(a,b)
df1$dupID = NA
df1$dupID = with(df1,ave(dupID,b,a,FUN = seq_along))
我们可以使用rowid
library(data.table)
setDT(df1)[, dupID := rowid(a, b)]
-输出
> df1
a b dupID
1: one 3.5 1
2: one 3.5 2
3: one 3.5 3
4: one 2.5 1
5: two 3.5 1
6: two 3.5 2
7: three 1.0 1
8: four 2.2 1
9: four 7.0 1
10: four 7.0 2