R:查找一个df col中的字符并将其替换为另一个df中的字符
R: Find and replace character in one df col with character from another df
我有一个 df(数据框 A),其中有超过 20 万行,错误的 ID 散布在单个列 (id) 中。我有另一个 df(数据框 B),其中数据框 A 的所有错误 ID 都与更正后的 ID 相匹配。如何使用数据框 B 修复数据框 A 的 id 列中的错误?
旧数据框 A 列:
id
C-2005-8-11-14
C-2005-8-11-15
C-2005-8-11-16
C-2006-3-7-1
C-2007-1-10-8
C-2007-1-10-9
C-2007-1-10-10
C-2008-7-2-4
C-2009-1-15-41
数据帧 B:
bad_id correct_id
C-2005-8-11-14 C-2005-8-22-14
C-2006-3-7-1 C-2006-3-30-1
C-2009-1-15-41 C-2009-1-12-41
新数据框 A 列:
id
C-2005-8-22-14
C-2005-8-11-15
C-2005-8-11-16
C-2006-3-30-1
C-2007-1-10-8
C-2007-1-10-9
C-2007-1-10-10
C-2008-7-2-4
C-2009-1-12-41
可能的解决方案:
library(dplyr)
df1 %>%
full_join(df2, by = c("id" = "bad_id")) %>%
mutate(correct_id = coalesce(id, correct_id), id = NULL)
#> correct_id
#> 1 C-2005-8-11-14
#> 2 C-2005-8-11-15
#> 3 C-2005-8-11-16
#> 4 C-2006-3-7-1
#> 5 C-2007-1-10-8
#> 6 C-2007-1-10-9
#> 7 C-2007-1-10-10
#> 8 C-2008-7-2-4
#> 9 C-2009-1-15-41
我们可以用 ifelse
语句来完成:假设 df1 = A, df2 = B:
library(dplyr)
A %>%
mutate(id = ifelse(id %in% B$bad_id, B$correct_id, id))
id
1 C-2005-8-22-14
2 C-2005-8-11-15
3 C-2005-8-11-16
4 C-2005-8-22-14
5 C-2007-1-10-8
6 C-2007-1-10-9
7 C-2007-1-10-10
8 C-2008-7-2-4
9 C-2009-1-12-41
我有一个 df(数据框 A),其中有超过 20 万行,错误的 ID 散布在单个列 (id) 中。我有另一个 df(数据框 B),其中数据框 A 的所有错误 ID 都与更正后的 ID 相匹配。如何使用数据框 B 修复数据框 A 的 id 列中的错误?
旧数据框 A 列:
id
C-2005-8-11-14
C-2005-8-11-15
C-2005-8-11-16
C-2006-3-7-1
C-2007-1-10-8
C-2007-1-10-9
C-2007-1-10-10
C-2008-7-2-4
C-2009-1-15-41
数据帧 B:
bad_id correct_id
C-2005-8-11-14 C-2005-8-22-14
C-2006-3-7-1 C-2006-3-30-1
C-2009-1-15-41 C-2009-1-12-41
新数据框 A 列:
id
C-2005-8-22-14
C-2005-8-11-15
C-2005-8-11-16
C-2006-3-30-1
C-2007-1-10-8
C-2007-1-10-9
C-2007-1-10-10
C-2008-7-2-4
C-2009-1-12-41
可能的解决方案:
library(dplyr)
df1 %>%
full_join(df2, by = c("id" = "bad_id")) %>%
mutate(correct_id = coalesce(id, correct_id), id = NULL)
#> correct_id
#> 1 C-2005-8-11-14
#> 2 C-2005-8-11-15
#> 3 C-2005-8-11-16
#> 4 C-2006-3-7-1
#> 5 C-2007-1-10-8
#> 6 C-2007-1-10-9
#> 7 C-2007-1-10-10
#> 8 C-2008-7-2-4
#> 9 C-2009-1-15-41
我们可以用 ifelse
语句来完成:假设 df1 = A, df2 = B:
library(dplyr)
A %>%
mutate(id = ifelse(id %in% B$bad_id, B$correct_id, id))
id
1 C-2005-8-22-14
2 C-2005-8-11-15
3 C-2005-8-11-16
4 C-2005-8-22-14
5 C-2007-1-10-8
6 C-2007-1-10-9
7 C-2007-1-10-10
8 C-2008-7-2-4
9 C-2009-1-12-41