用R中的部分匹配替换整个单词或单词

Question

我有一个包含数千个拼写错误的城市名称的数据框。我需要更正这些并且无法找到解决方案，尽管我已经广泛搜索了。我尝试了几种功能和方法

这是数据的微型样本：

citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
               "city" = c("BORNE","BOERNAE","BARNE","BOERNE",
                          "GALDEN","GELDON","GOELDEN","GOLDEN"))

   num    city
1   1   BORNE
2   2 BOERNAE
3   3   BARNE
4   4  BOERNE
5   5  GALDEN
6   6  GELDON
7   7 GOELDEN
8   8  GOLDEN

这些是我尝试过的一些功能，还尝试了更多，包括 str_replace 和 str_detect:

cit <- function(x){
  ifelse(x %in% grepl(c("BOR","BOE","BAR")),"BOERNE",
         ifelse(x %in% grepl(c("GAL","GEL","GOE")), "GOLDEN", "OTHER"))
}

或

cit <- function(x){
  ifelse(x %in% c("BOR","BOE","BAR"),"BOERNE",
         ifelse(x %in% c("GAL","GEL","GOE"), "GOLDEN", "OTHER"))
}

运行代码：

`citA$city2 <- cit(citA$city)`

不正确的结果：

  num    city city2
1   1  BOERNE OTHER
2   2 BOERNAE OTHER
3   3   BARNE OTHER
4   4  BOERNE OTHER
5   5  GALDEN OTHER
6   6  GELDON OTHER
7   7 GOELDEN OTHER
8   8  GOLDEN OTHER

也尝试过：

citA$city[grepl(c("BOR","BOE","BAR"),citA$city)] <- "BOERNE"

但这会引发错误：

Warning message:
In grepl(c("BOR", "BOE", "BAR"), citA$city) :
  argument 'pattern' has length > 1 and only the first element will be used

你的想法会很有帮助！

Answer 1

我们可以 paste 将其 grep 中的 pattern 与 | （意思是 OR）的单个字符串。 grep 中的 pattern 参数未向量化，即它只需要一个元素

citA$city[grepl(paste(c("BOR","BOE","BAR"), collapse="|"),citA$city)] <- "BOERNE" 
citA
#  num    city
#1   1  BOERNE
#2   2  BOERNE
#3   3  BOERNE
#4   4  BOERNE
#5   5  GALDEN
#6   6  GELDON
#7   7 GOELDEN
#8   8  GOLDEN

注意：'city' 列创建为 factor。它应该是 character class 通过使用 stringsAsFactors = FALSE

数据

citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
           "city" = c("BORNE","BOERNAE","BARNE","BOERNE",
                      "GALDEN","GELDON","GOELDEN","GOLDEN"),
        stringsAsFactors = FALSE)

Answer 2

如果您有很多这样的模式，您可以使用 dplyr 中的 case_when :

library(dplyr)
library(stringr)

citA %>%
  mutate(city2 = case_when(str_detect(city, 'BOR|BOE|BAR') ~ 'BOERNE', 
                           str_detect(city, 'GAL|GEL|GOE|GOL') ~ 'GOLDEN',
                           TRUE ~ 'OTHER'))

#  num    city  city2
#1   1   BORNE BOERNE
#2   2 BOERNAE BOERNE
#3   3   BARNE BOERNE
#4   4  BOERNE BOERNE
#5   5  GALDEN GOLDEN
#6   6  GELDON GOLDEN
#7   7 GOELDEN GOLDEN
#8   8  GOLDEN GOLDEN

Answer 3

我在 github 上有一个包可能会有所帮助，它允许使用正则表达式匹配重新编码因子水平。使用

加载包

devtools::install_github("jwilliman/xfactor")


citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
                   "city" = c("BORNE","BOERNAE","BARNE","BOERNE",
                              "GALDEN","GELDON","GOELDEN","GOLDEN"))

citA$city2 <- xfactor::xfactor(citA$city, levels = c(BOERNE = "BOR|BOE|BAR", GOLDEN = "GAL|GEL|GOE|GOL"))

citA
#>   num    city  city2
#> 1   1   BORNE BOERNE
#> 2   2 BOERNAE BOERNE
#> 3   3   BARNE BOERNE
#> 4   4  BOERNE BOERNE
#> 5   5  GALDEN GOLDEN
#> 6   6  GELDON GOLDEN
#> 7   7 GOELDEN GOLDEN
#> 8   8  GOLDEN GOLDEN

^{由 reprex package (v0.3.0)}

于 2020 年 4 月 20 日创建

否则，您可以使用以下函数来 clean/update 因子水平，使用类似的语法。


  citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
                     "city" = c("BORNE","BOERNAE","BARNE","BOERNE",
                                "GALDEN","GELDON","GOELDEN","GOLDEN"))

make_levels <- function(.f, patterns, replacement = NULL, ignore.case = FALSE) {

  lvls <- levels(.f)

  # Replacements can be listed in the replacement argument, taken as names in patterns, or the patterns themselves.
  if(is.null(replacement)) {
    if(is.null(names(patterns)))
      replacement <- patterns
    else
      replacement <- names(patterns)
  }

  # Find matching levels
  lvl_match <- setNames(vector("list", length = length(patterns)), replacement)
  for(i in seq_along(patterns))
    lvl_match[[replacement[i]]] <- grep(patterns[i], lvls, ignore.case = ignore.case, value = TRUE)

  # Append other non-matching levels
  lvl_other <- setdiff(lvls, unlist(lvl_match))
  lvl_all <- append(
    lvl_match, 
    setNames(as.list(lvl_other), lvl_other)
  )

  return(lvl_all)

}

levels(citA$city) <- make_levels(citA$city, c(BOERNE = "BOR|BOE|BAR", GOLDEN = "GAL|GEL|GOE|GOL"))

citA
#>   num   city
#> 1   1 BOERNE
#> 2   2 BOERNE
#> 3   3 BOERNE
#> 4   4 BOERNE
#> 5   5 GOLDEN
#> 6   6 GOLDEN
#> 7   7 GOLDEN
#> 8   8 GOLDEN

^{由 reprex package (v0.3.0)}

于 2020 年 4 月 20 日创建

用R中的部分匹配替换整个单词或单词

Replace whole word or words with partial match in R

r

stringr

grepl

数据