用R中的部分匹配替换整个单词或单词
Replace whole word or words with partial match in R
我有一个包含数千个拼写错误的城市名称的数据框。我需要更正这些并且无法找到解决方案,尽管我已经广泛搜索了。我尝试了几种功能和方法
这是数据的微型样本:
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"))
num city
1 1 BORNE
2 2 BOERNAE
3 3 BARNE
4 4 BOERNE
5 5 GALDEN
6 6 GELDON
7 7 GOELDEN
8 8 GOLDEN
这些是我尝试过的一些功能,还尝试了更多,包括 str_replace 和 str_detect:
cit <- function(x){
ifelse(x %in% grepl(c("BOR","BOE","BAR")),"BOERNE",
ifelse(x %in% grepl(c("GAL","GEL","GOE")), "GOLDEN", "OTHER"))
}
或
cit <- function(x){
ifelse(x %in% c("BOR","BOE","BAR"),"BOERNE",
ifelse(x %in% c("GAL","GEL","GOE"), "GOLDEN", "OTHER"))
}
运行代码:
`citA$city2 <- cit(citA$city)`
不正确的结果:
num city city2
1 1 BOERNE OTHER
2 2 BOERNAE OTHER
3 3 BARNE OTHER
4 4 BOERNE OTHER
5 5 GALDEN OTHER
6 6 GELDON OTHER
7 7 GOELDEN OTHER
8 8 GOLDEN OTHER
也尝试过:
citA$city[grepl(c("BOR","BOE","BAR"),citA$city)] <- "BOERNE"
但这会引发错误:
Warning message:
In grepl(c("BOR", "BOE", "BAR"), citA$city) :
argument 'pattern' has length > 1 and only the first element will be used
你的想法会很有帮助!
我们可以 paste
将其 grep
中的 pattern
与 |
(意思是 OR
)的单个字符串。 grep
中的 pattern
参数未向量化,即它只需要一个元素
citA$city[grepl(paste(c("BOR","BOE","BAR"), collapse="|"),citA$city)] <- "BOERNE"
citA
# num city
#1 1 BOERNE
#2 2 BOERNE
#3 3 BOERNE
#4 4 BOERNE
#5 5 GALDEN
#6 6 GELDON
#7 7 GOELDEN
#8 8 GOLDEN
注意:'city' 列创建为 factor
。它应该是 character
class 通过使用 stringsAsFactors = FALSE
数据
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"),
stringsAsFactors = FALSE)
如果您有很多这样的模式,您可以使用 dplyr
中的 case_when
:
library(dplyr)
library(stringr)
citA %>%
mutate(city2 = case_when(str_detect(city, 'BOR|BOE|BAR') ~ 'BOERNE',
str_detect(city, 'GAL|GEL|GOE|GOL') ~ 'GOLDEN',
TRUE ~ 'OTHER'))
# num city city2
#1 1 BORNE BOERNE
#2 2 BOERNAE BOERNE
#3 3 BARNE BOERNE
#4 4 BOERNE BOERNE
#5 5 GALDEN GOLDEN
#6 6 GELDON GOLDEN
#7 7 GOELDEN GOLDEN
#8 8 GOLDEN GOLDEN
我在 github 上有一个包可能会有所帮助,它允许使用正则表达式匹配重新编码因子水平。使用
加载包
devtools::install_github("jwilliman/xfactor")
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"))
citA$city2 <- xfactor::xfactor(citA$city, levels = c(BOERNE = "BOR|BOE|BAR", GOLDEN = "GAL|GEL|GOE|GOL"))
citA
#> num city city2
#> 1 1 BORNE BOERNE
#> 2 2 BOERNAE BOERNE
#> 3 3 BARNE BOERNE
#> 4 4 BOERNE BOERNE
#> 5 5 GALDEN GOLDEN
#> 6 6 GELDON GOLDEN
#> 7 7 GOELDEN GOLDEN
#> 8 8 GOLDEN GOLDEN
由 reprex package (v0.3.0)
于 2020 年 4 月 20 日创建
否则,您可以使用以下函数来 clean/update 因子水平,使用类似的语法。
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"))
make_levels <- function(.f, patterns, replacement = NULL, ignore.case = FALSE) {
lvls <- levels(.f)
# Replacements can be listed in the replacement argument, taken as names in patterns, or the patterns themselves.
if(is.null(replacement)) {
if(is.null(names(patterns)))
replacement <- patterns
else
replacement <- names(patterns)
}
# Find matching levels
lvl_match <- setNames(vector("list", length = length(patterns)), replacement)
for(i in seq_along(patterns))
lvl_match[[replacement[i]]] <- grep(patterns[i], lvls, ignore.case = ignore.case, value = TRUE)
# Append other non-matching levels
lvl_other <- setdiff(lvls, unlist(lvl_match))
lvl_all <- append(
lvl_match,
setNames(as.list(lvl_other), lvl_other)
)
return(lvl_all)
}
levels(citA$city) <- make_levels(citA$city, c(BOERNE = "BOR|BOE|BAR", GOLDEN = "GAL|GEL|GOE|GOL"))
citA
#> num city
#> 1 1 BOERNE
#> 2 2 BOERNE
#> 3 3 BOERNE
#> 4 4 BOERNE
#> 5 5 GOLDEN
#> 6 6 GOLDEN
#> 7 7 GOLDEN
#> 8 8 GOLDEN
由 reprex package (v0.3.0)
于 2020 年 4 月 20 日创建
我有一个包含数千个拼写错误的城市名称的数据框。我需要更正这些并且无法找到解决方案,尽管我已经广泛搜索了。我尝试了几种功能和方法
这是数据的微型样本:
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"))
num city
1 1 BORNE
2 2 BOERNAE
3 3 BARNE
4 4 BOERNE
5 5 GALDEN
6 6 GELDON
7 7 GOELDEN
8 8 GOLDEN
这些是我尝试过的一些功能,还尝试了更多,包括 str_replace 和 str_detect:
cit <- function(x){
ifelse(x %in% grepl(c("BOR","BOE","BAR")),"BOERNE",
ifelse(x %in% grepl(c("GAL","GEL","GOE")), "GOLDEN", "OTHER"))
}
或
cit <- function(x){
ifelse(x %in% c("BOR","BOE","BAR"),"BOERNE",
ifelse(x %in% c("GAL","GEL","GOE"), "GOLDEN", "OTHER"))
}
运行代码:
`citA$city2 <- cit(citA$city)`
不正确的结果:
num city city2
1 1 BOERNE OTHER
2 2 BOERNAE OTHER
3 3 BARNE OTHER
4 4 BOERNE OTHER
5 5 GALDEN OTHER
6 6 GELDON OTHER
7 7 GOELDEN OTHER
8 8 GOLDEN OTHER
也尝试过:
citA$city[grepl(c("BOR","BOE","BAR"),citA$city)] <- "BOERNE"
但这会引发错误:
Warning message:
In grepl(c("BOR", "BOE", "BAR"), citA$city) :
argument 'pattern' has length > 1 and only the first element will be used
你的想法会很有帮助!
我们可以 paste
将其 grep
中的 pattern
与 |
(意思是 OR
)的单个字符串。 grep
中的 pattern
参数未向量化,即它只需要一个元素
citA$city[grepl(paste(c("BOR","BOE","BAR"), collapse="|"),citA$city)] <- "BOERNE"
citA
# num city
#1 1 BOERNE
#2 2 BOERNE
#3 3 BOERNE
#4 4 BOERNE
#5 5 GALDEN
#6 6 GELDON
#7 7 GOELDEN
#8 8 GOLDEN
注意:'city' 列创建为 factor
。它应该是 character
class 通过使用 stringsAsFactors = FALSE
数据
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"),
stringsAsFactors = FALSE)
如果您有很多这样的模式,您可以使用 dplyr
中的 case_when
:
library(dplyr)
library(stringr)
citA %>%
mutate(city2 = case_when(str_detect(city, 'BOR|BOE|BAR') ~ 'BOERNE',
str_detect(city, 'GAL|GEL|GOE|GOL') ~ 'GOLDEN',
TRUE ~ 'OTHER'))
# num city city2
#1 1 BORNE BOERNE
#2 2 BOERNAE BOERNE
#3 3 BARNE BOERNE
#4 4 BOERNE BOERNE
#5 5 GALDEN GOLDEN
#6 6 GELDON GOLDEN
#7 7 GOELDEN GOLDEN
#8 8 GOLDEN GOLDEN
我在 github 上有一个包可能会有所帮助,它允许使用正则表达式匹配重新编码因子水平。使用
加载包devtools::install_github("jwilliman/xfactor")
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"))
citA$city2 <- xfactor::xfactor(citA$city, levels = c(BOERNE = "BOR|BOE|BAR", GOLDEN = "GAL|GEL|GOE|GOL"))
citA
#> num city city2
#> 1 1 BORNE BOERNE
#> 2 2 BOERNAE BOERNE
#> 3 3 BARNE BOERNE
#> 4 4 BOERNE BOERNE
#> 5 5 GALDEN GOLDEN
#> 6 6 GELDON GOLDEN
#> 7 7 GOELDEN GOLDEN
#> 8 8 GOLDEN GOLDEN
由 reprex package (v0.3.0)
于 2020 年 4 月 20 日创建否则,您可以使用以下函数来 clean/update 因子水平,使用类似的语法。
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"))
make_levels <- function(.f, patterns, replacement = NULL, ignore.case = FALSE) {
lvls <- levels(.f)
# Replacements can be listed in the replacement argument, taken as names in patterns, or the patterns themselves.
if(is.null(replacement)) {
if(is.null(names(patterns)))
replacement <- patterns
else
replacement <- names(patterns)
}
# Find matching levels
lvl_match <- setNames(vector("list", length = length(patterns)), replacement)
for(i in seq_along(patterns))
lvl_match[[replacement[i]]] <- grep(patterns[i], lvls, ignore.case = ignore.case, value = TRUE)
# Append other non-matching levels
lvl_other <- setdiff(lvls, unlist(lvl_match))
lvl_all <- append(
lvl_match,
setNames(as.list(lvl_other), lvl_other)
)
return(lvl_all)
}
levels(citA$city) <- make_levels(citA$city, c(BOERNE = "BOR|BOE|BAR", GOLDEN = "GAL|GEL|GOE|GOL"))
citA
#> num city
#> 1 1 BOERNE
#> 2 2 BOERNE
#> 3 3 BOERNE
#> 4 4 BOERNE
#> 5 5 GOLDEN
#> 6 6 GOLDEN
#> 7 7 GOLDEN
#> 8 8 GOLDEN
由 reprex package (v0.3.0)
于 2020 年 4 月 20 日创建