"Marking" R 中的重复项
"Marking" Duplicates in R
我正在使用 R 编程语言。假设我有以下数据:
Data_I_Have <- data.frame(
"Person" = c("John", "John", "John", "Peter", "Peter", "Peter", "Tim", "Kevin", "Adam", "Adam", "Xavier"),
"Number_of_Kids" = c("4", "1", "1", "5", "2", "3", "7", "0", "3", "3", "5")
)
Person Number_of_Kids
1 John 4
2 John 1
3 John 1
4 Peter 5
5 Peter 2
6 Peter 3
7 Tim 7
8 Kevin 0
9 Adam 3
10 Adam 3
11 Xavier 5
是否可以“标记”每个重复的名称,使其看起来像下面的文件(例如 John_1、John_2 等)?
Data_I_Want <- data.frame(
"Person" = c("John_1", "John_2", "John_3", "Peter_1", "Peter_2", "Peter_3", "Tim", "Kevin", "Adam_1", "Adam_2", "Xavier"),
"Number_of_Kids" = c("4", "1", "1", "5", "2", "3", "7", "0", "3", "3", "5")
)
Person Number_of_Kids
1 John_1 4
2 John_2 1
3 John_3 1
4 Peter_1 5
5 Peter_2 2
6 Peter_3 3
7 Tim 7
8 Kevin 0
9 Adam_1 3
10 Adam_2 3
11 Xavier 5
使用上一个问题 ,我尝试按照那里使用的方法进行操作:
Data_I_Want <- make.unique(Data_I_Have, sep = '_')
但这给了我以下错误:
Error in make.unique(Data_I_Have, sep = "_") :
'names' must be a character vector
有人可以告诉我如何解决这个问题吗?
谢谢!
make.unique
需要一个向量而不是 data.frame 并且默认情况下输出将附加 1、2、3 和 .
(因为 sep
仅来自副本值而不是从一开始。即
> make.unique(Data_I_Have$Person)
[1] "John" "John.1" "John.2" "Peter" "Peter.1" "Peter.2" "Tim" "Kevin" "Adam" "Adam.1" "Xavier"
如果我们想要获得所需的输出,请按 'Person' 分组,然后将 row_number()
与组列连接,然后 ungroup()
它。
library(dplyr)
library(stringr)
Data_I_Have %>%
group_by(Person) %>%
mutate(Person = case_when(n() > 1 ~
str_c(Person, "_", row_number()), TRUE ~ Person)) %>%
ungroup()
-输出
# A tibble: 11 x 2
Person Number_of_Kids
<chr> <chr>
1 John_1 4
2 John_2 1
3 John_3 1
4 Peter_1 5
5 Peter_2 2
6 Peter_3 3
7 Tim 7
8 Kevin 0
9 Adam_1 3
10 Adam_2 3
11 Xavier 5
这是另一种方法(更新,更正版本):
library(dplyr)
Data_I_Have %>%
group_by(Person) %>%
mutate(id = row_number(),
n = n(),
Person = ifelse(n >1, paste(Person, id, sep="_"), Person)) %>%
select(-id, -n)
Person Number_of_Kids
<chr> <chr>
1 John_1 4
2 John_2 1
3 John_3 1
4 Peter_1 5
5 Peter_2 2
6 Peter_3 3
7 Tim 7
8 Kevin 0
9 Adam_1 3
10 Adam_2 3
11 Xavier 5
使用 cumsum()
的 tidyverse 选项。
library(dplyr)
Data_I_Have %>%
group_by(Person) %>%
mutate(cnt = 1,
Person = str_c(Person, cumsum(cnt), sep = '_')) %>%
ungroup() %>%
select(-cnt)
# # A tibble: 11 x 2
# Person Number_of_Kids
# <chr> <chr>
# 1 John_1 4
# 2 John_2 1
# 3 John_3 1
# 4 Peter_1 5
# 5 Peter_2 2
# 6 Peter_3 3
# 7 Tim_1 7
# 8 Kevin_1 0
# 9 Adam_1 3
# 10 Adam_2 3
# 11 Xavier_1 5
使用 data.table,其中 .N
是组中行数的符号。
Data_I_Have <- data.frame(
"Person" = c("John", "John", "John", "Peter", "Peter", "Peter", "Tim", "Kevin", "Adam", "Adam", "Xavier"),
"Number_of_Kids" = c("4", "1", "1", "5", "2", "3", "7", "0", "3", "3", "5")
)
library(data.table)
setDT(Data_I_Have)
Data_I_Have[, Person := if (.N == 1) Person else paste0(Person, "_", seq(.N)),
by = Person]
Data_I_Have
#> Person Number_of_Kids
#> 1: John_1 4
#> 2: John_2 1
#> 3: John_3 1
#> 4: Peter_1 5
#> 5: Peter_2 2
#> 6: Peter_3 3
#> 7: Tim 7
#> 8: Kevin 0
#> 9: Adam_1 3
#> 10: Adam_2 3
#> 11: Xavier 5
由 reprex package (v2.0.1)
于 2021-09-15 创建
我正在使用 R 编程语言。假设我有以下数据:
Data_I_Have <- data.frame(
"Person" = c("John", "John", "John", "Peter", "Peter", "Peter", "Tim", "Kevin", "Adam", "Adam", "Xavier"),
"Number_of_Kids" = c("4", "1", "1", "5", "2", "3", "7", "0", "3", "3", "5")
)
Person Number_of_Kids
1 John 4
2 John 1
3 John 1
4 Peter 5
5 Peter 2
6 Peter 3
7 Tim 7
8 Kevin 0
9 Adam 3
10 Adam 3
11 Xavier 5
是否可以“标记”每个重复的名称,使其看起来像下面的文件(例如 John_1、John_2 等)?
Data_I_Want <- data.frame(
"Person" = c("John_1", "John_2", "John_3", "Peter_1", "Peter_2", "Peter_3", "Tim", "Kevin", "Adam_1", "Adam_2", "Xavier"),
"Number_of_Kids" = c("4", "1", "1", "5", "2", "3", "7", "0", "3", "3", "5")
)
Person Number_of_Kids
1 John_1 4
2 John_2 1
3 John_3 1
4 Peter_1 5
5 Peter_2 2
6 Peter_3 3
7 Tim 7
8 Kevin 0
9 Adam_1 3
10 Adam_2 3
11 Xavier 5
使用上一个问题
Data_I_Want <- make.unique(Data_I_Have, sep = '_')
但这给了我以下错误:
Error in make.unique(Data_I_Have, sep = "_") :
'names' must be a character vector
有人可以告诉我如何解决这个问题吗?
谢谢!
make.unique
需要一个向量而不是 data.frame 并且默认情况下输出将附加 1、2、3 和 .
(因为 sep
仅来自副本值而不是从一开始。即
> make.unique(Data_I_Have$Person)
[1] "John" "John.1" "John.2" "Peter" "Peter.1" "Peter.2" "Tim" "Kevin" "Adam" "Adam.1" "Xavier"
如果我们想要获得所需的输出,请按 'Person' 分组,然后将 row_number()
与组列连接,然后 ungroup()
它。
library(dplyr)
library(stringr)
Data_I_Have %>%
group_by(Person) %>%
mutate(Person = case_when(n() > 1 ~
str_c(Person, "_", row_number()), TRUE ~ Person)) %>%
ungroup()
-输出
# A tibble: 11 x 2
Person Number_of_Kids
<chr> <chr>
1 John_1 4
2 John_2 1
3 John_3 1
4 Peter_1 5
5 Peter_2 2
6 Peter_3 3
7 Tim 7
8 Kevin 0
9 Adam_1 3
10 Adam_2 3
11 Xavier 5
这是另一种方法(更新,更正版本):
library(dplyr)
Data_I_Have %>%
group_by(Person) %>%
mutate(id = row_number(),
n = n(),
Person = ifelse(n >1, paste(Person, id, sep="_"), Person)) %>%
select(-id, -n)
Person Number_of_Kids
<chr> <chr>
1 John_1 4
2 John_2 1
3 John_3 1
4 Peter_1 5
5 Peter_2 2
6 Peter_3 3
7 Tim 7
8 Kevin 0
9 Adam_1 3
10 Adam_2 3
11 Xavier 5
使用 cumsum()
的 tidyverse 选项。
library(dplyr)
Data_I_Have %>%
group_by(Person) %>%
mutate(cnt = 1,
Person = str_c(Person, cumsum(cnt), sep = '_')) %>%
ungroup() %>%
select(-cnt)
# # A tibble: 11 x 2
# Person Number_of_Kids
# <chr> <chr>
# 1 John_1 4
# 2 John_2 1
# 3 John_3 1
# 4 Peter_1 5
# 5 Peter_2 2
# 6 Peter_3 3
# 7 Tim_1 7
# 8 Kevin_1 0
# 9 Adam_1 3
# 10 Adam_2 3
# 11 Xavier_1 5
使用 data.table,其中 .N
是组中行数的符号。
Data_I_Have <- data.frame(
"Person" = c("John", "John", "John", "Peter", "Peter", "Peter", "Tim", "Kevin", "Adam", "Adam", "Xavier"),
"Number_of_Kids" = c("4", "1", "1", "5", "2", "3", "7", "0", "3", "3", "5")
)
library(data.table)
setDT(Data_I_Have)
Data_I_Have[, Person := if (.N == 1) Person else paste0(Person, "_", seq(.N)),
by = Person]
Data_I_Have
#> Person Number_of_Kids
#> 1: John_1 4
#> 2: John_2 1
#> 3: John_3 1
#> 4: Peter_1 5
#> 5: Peter_2 2
#> 6: Peter_3 3
#> 7: Tim 7
#> 8: Kevin 0
#> 9: Adam_1 3
#> 10: Adam_2 3
#> 11: Xavier 5
由 reprex package (v2.0.1)
于 2021-09-15 创建