无法减少空格、删除重音和替换 R 中的单词

Can't reduce whitespaces, remove accent, and replace words in R

我有以下数据框:

structure(list(matching_var = c("BD Mymensingh", "CN Nei Mongol ", "EC Los Ríos", "MY  Johor", "MY  Kedah", "MY  Kelantan", "MY Negeri Sembilan", "RU Amurskaja oblast")), row.names = c(44L,174L, 259L, 694L, 695L, 696L, 700L, 1029L), class = "data.frame")

我希望它变成:

structure(list(matching_var = c("BD Mymensingh", "CN Nei Mongol", "EC Los Rios", "MY Johor", "MY Kedah", "MY Kelantan", "MY Negeri Sembilan", "RU Amur")), row.names = c(44L, 174L, 259L, 694L, 695L, 696L, 700L, 1029L), class = "data.frame")

我尝试了以下代码,但它不起作用:

library(dplyr)
library(stringr)

data <- data %>%
  mutate(matching_var = str_replace(matching_var, "MY  Johor", "MY Johor")) %>% 
  mutate(matching_var = str_replace(matching_var, "Los Ríos", "Los Rios"))%>% 
  mutate(matching_var = str_replace(matching_var, "MY  Kedah", "MY Kedah"))%>% 
  mutate(matching_var = str_replace(matching_var, "MY  Kelantan", "MY Kelantan "))%>% 
  mutate(matching_var = str_replace(matching_var, "Amurskaja oblast", "Amur"))

我也试过这个方法来处理空格,但仍然没有:

data$matching_var <-str_replace_all(data$matching_var, fixed("  "), " ")

输出看起来和输入一模一样,我不明白为什么。感谢您的帮助。

  1. 如果要用固定字符串进行替换,一种选择是 merge/left_join 转换 table:

    conversions <- tribble(
    ~fm,               ~ to,
    "MY  Johor"        , "MY Johor",
    "Los Ríos"         , "Los Rios",
    "MY  Kedah"        , "MY Kedah",
    "MY  Kelantan"     , "MY Kelantan ",
    "Amurskaja oblast" , "Amur")
    
    left_join(data, conversions, by = c("matching_var" = "fm")) %>%
      mutate(
        new_matching_var = coalesce(to, matching_var)
      ) %>%
      select(-to)
    #          matching_var    new_matching_var
    # 1       BD Mymensingh       BD Mymensingh
    # 2      CN Nei Mongol       CN Nei Mongol 
    # 3         EC Los Ríos         EC Los Ríos
    # 4           MY  Johor            MY Johor
    # 5           MY  Kedah            MY Kedah
    # 6        MY  Kelantan        MY Kelantan 
    # 7  MY Negeri Sembilan  MY Negeri Sembilan
    # 8 RU Amurskaja oblast RU Amurskaja oblast
    

    (请注意,"MY Kelantan " 有一个尾随的 space,您已将其添加到自己的代码中。)

  2. 另一种选择是使用 conversionsmatch,而不使用 merge/join:

    data %>%
      mutate(
        ind = match(matching_var, conversions$fm), 
        new_matching_var = if_else(is.na(ind), 
        matching_var, conversions$to[ind])
      ) %>%
      select(-ind)
    #             matching_var    new_matching_var
    # 44         BD Mymensingh       BD Mymensingh
    # 174       CN Nei Mongol       CN Nei Mongol 
    # 259          EC Los Ríos         EC Los Ríos
    # 694            MY  Johor            MY Johor
    # 695            MY  Kedah            MY Kedah
    # 696         MY  Kelantan        MY Kelantan 
    # 700   MY Negeri Sembilan  MY Negeri Sembilan
    # 1029 RU Amurskaja oblast RU Amurskaja oblast
    

    虽然这可以使用命名向量来完成,但我推荐它的一个原因是它易于维护(例如,作为目录中的 CSV;您可以使用 Excel 或 Calc to edit/maintain 您想要的转化列表)。

  3. 对于删除空白-space,可以使用trimws从字符串的开头或结尾删除多余的space。我会继续

    data %>%
      mutate(
        ind = match(matching_var, conversions$fm), 
        new_matching_var = if_else(is.na(ind), 
        matching_var, conversions$to[ind])
      ) %>%
      select(-ind) %>%
      mutate(new_matching_var2 = trimws(new_matching_var))
    #             matching_var    new_matching_var   new_matching_var2
    # 44         BD Mymensingh       BD Mymensingh       BD Mymensingh
    # 174       CN Nei Mongol       CN Nei Mongol        CN Nei Mongol
    # 259          EC Los Ríos         EC Los Ríos         EC Los Ríos
    # 694            MY  Johor            MY Johor            MY Johor
    # 695            MY  Kedah            MY Kedah            MY Kedah
    # 696         MY  Kelantan        MY Kelantan          MY Kelantan
    # 700   MY Negeri Sembilan  MY Negeri Sembilan  MY Negeri Sembilan
    # 1029 RU Amurskaja oblast RU Amurskaja oblast RU Amurskaja oblast
    

    虽然我不知道什么对 str_replace_all(data$matching_var, fixed(" "), " ") 不起作用,但 确实 删除了我的所有文字中间字符串 " "看到了(嗯,在这一点上,没有双 space 剩余,但它会有)。

您可以用更简单的两步清理来替换所有这些,假设您不需要以其他方式更改值:

data %>%
  mutate(
    new_matching_var = trimws(str_replace(matching_var, fixed("  "), " "))
  )
#             matching_var    new_matching_var
# 44         BD Mymensingh       BD Mymensingh
# 174       CN Nei Mongol        CN Nei Mongol
# 259          EC Los Ríos         EC Los Ríos
# 694            MY  Johor            MY Johor
# 695            MY  Kedah            MY Kedah
# 696         MY  Kelantan         MY Kelantan
# 700   MY Negeri Sembilan  MY Negeri Sembilan
# 1029 RU Amurskaja oblast RU Amurskaja oblast