无法减少空格、删除重音和替换 R 中的单词
Can't reduce whitespaces, remove accent, and replace words in R
我有以下数据框:
structure(list(matching_var = c("BD Mymensingh", "CN Nei Mongol ", "EC Los Ríos", "MY Johor", "MY Kedah", "MY Kelantan", "MY Negeri Sembilan", "RU Amurskaja oblast")), row.names = c(44L,174L, 259L, 694L, 695L, 696L, 700L, 1029L), class = "data.frame")
我希望它变成:
structure(list(matching_var = c("BD Mymensingh", "CN Nei Mongol", "EC Los Rios", "MY Johor", "MY Kedah", "MY Kelantan", "MY Negeri Sembilan", "RU Amur")), row.names = c(44L, 174L, 259L, 694L, 695L, 696L, 700L, 1029L), class = "data.frame")
我尝试了以下代码,但它不起作用:
library(dplyr)
library(stringr)
data <- data %>%
mutate(matching_var = str_replace(matching_var, "MY Johor", "MY Johor")) %>%
mutate(matching_var = str_replace(matching_var, "Los Ríos", "Los Rios"))%>%
mutate(matching_var = str_replace(matching_var, "MY Kedah", "MY Kedah"))%>%
mutate(matching_var = str_replace(matching_var, "MY Kelantan", "MY Kelantan "))%>%
mutate(matching_var = str_replace(matching_var, "Amurskaja oblast", "Amur"))
我也试过这个方法来处理空格,但仍然没有:
data$matching_var <-str_replace_all(data$matching_var, fixed(" "), " ")
输出看起来和输入一模一样,我不明白为什么。感谢您的帮助。
如果要用固定字符串进行替换,一种选择是 merge
/left_join
转换 table:
conversions <- tribble(
~fm, ~ to,
"MY Johor" , "MY Johor",
"Los Ríos" , "Los Rios",
"MY Kedah" , "MY Kedah",
"MY Kelantan" , "MY Kelantan ",
"Amurskaja oblast" , "Amur")
left_join(data, conversions, by = c("matching_var" = "fm")) %>%
mutate(
new_matching_var = coalesce(to, matching_var)
) %>%
select(-to)
# matching_var new_matching_var
# 1 BD Mymensingh BD Mymensingh
# 2 CN Nei Mongol CN Nei Mongol
# 3 EC Los Ríos EC Los Ríos
# 4 MY Johor MY Johor
# 5 MY Kedah MY Kedah
# 6 MY Kelantan MY Kelantan
# 7 MY Negeri Sembilan MY Negeri Sembilan
# 8 RU Amurskaja oblast RU Amurskaja oblast
(请注意,"MY Kelantan "
有一个尾随的 space,您已将其添加到自己的代码中。)
另一种选择是使用 conversions
和 match
,而不使用 merge/join:
data %>%
mutate(
ind = match(matching_var, conversions$fm),
new_matching_var = if_else(is.na(ind),
matching_var, conversions$to[ind])
) %>%
select(-ind)
# matching_var new_matching_var
# 44 BD Mymensingh BD Mymensingh
# 174 CN Nei Mongol CN Nei Mongol
# 259 EC Los Ríos EC Los Ríos
# 694 MY Johor MY Johor
# 695 MY Kedah MY Kedah
# 696 MY Kelantan MY Kelantan
# 700 MY Negeri Sembilan MY Negeri Sembilan
# 1029 RU Amurskaja oblast RU Amurskaja oblast
虽然这可以使用命名向量来完成,但我推荐它的一个原因是它易于维护(例如,作为目录中的 CSV;您可以使用 Excel 或 Calc to edit/maintain 您想要的转化列表)。
对于删除空白-space,可以使用trimws
从字符串的开头或结尾删除多余的space。我会继续
data %>%
mutate(
ind = match(matching_var, conversions$fm),
new_matching_var = if_else(is.na(ind),
matching_var, conversions$to[ind])
) %>%
select(-ind) %>%
mutate(new_matching_var2 = trimws(new_matching_var))
# matching_var new_matching_var new_matching_var2
# 44 BD Mymensingh BD Mymensingh BD Mymensingh
# 174 CN Nei Mongol CN Nei Mongol CN Nei Mongol
# 259 EC Los Ríos EC Los Ríos EC Los Ríos
# 694 MY Johor MY Johor MY Johor
# 695 MY Kedah MY Kedah MY Kedah
# 696 MY Kelantan MY Kelantan MY Kelantan
# 700 MY Negeri Sembilan MY Negeri Sembilan MY Negeri Sembilan
# 1029 RU Amurskaja oblast RU Amurskaja oblast RU Amurskaja oblast
虽然我不知道什么对 str_replace_all(data$matching_var, fixed(" "), " ")
不起作用,但 确实 删除了我的所有文字中间字符串 " "
看到了(嗯,在这一点上,没有双 space 剩余,但它会有)。
您可以用更简单的两步清理来替换所有这些,假设您不需要以其他方式更改值:
data %>%
mutate(
new_matching_var = trimws(str_replace(matching_var, fixed(" "), " "))
)
# matching_var new_matching_var
# 44 BD Mymensingh BD Mymensingh
# 174 CN Nei Mongol CN Nei Mongol
# 259 EC Los Ríos EC Los Ríos
# 694 MY Johor MY Johor
# 695 MY Kedah MY Kedah
# 696 MY Kelantan MY Kelantan
# 700 MY Negeri Sembilan MY Negeri Sembilan
# 1029 RU Amurskaja oblast RU Amurskaja oblast
我有以下数据框:
structure(list(matching_var = c("BD Mymensingh", "CN Nei Mongol ", "EC Los Ríos", "MY Johor", "MY Kedah", "MY Kelantan", "MY Negeri Sembilan", "RU Amurskaja oblast")), row.names = c(44L,174L, 259L, 694L, 695L, 696L, 700L, 1029L), class = "data.frame")
我希望它变成:
structure(list(matching_var = c("BD Mymensingh", "CN Nei Mongol", "EC Los Rios", "MY Johor", "MY Kedah", "MY Kelantan", "MY Negeri Sembilan", "RU Amur")), row.names = c(44L, 174L, 259L, 694L, 695L, 696L, 700L, 1029L), class = "data.frame")
我尝试了以下代码,但它不起作用:
library(dplyr)
library(stringr)
data <- data %>%
mutate(matching_var = str_replace(matching_var, "MY Johor", "MY Johor")) %>%
mutate(matching_var = str_replace(matching_var, "Los Ríos", "Los Rios"))%>%
mutate(matching_var = str_replace(matching_var, "MY Kedah", "MY Kedah"))%>%
mutate(matching_var = str_replace(matching_var, "MY Kelantan", "MY Kelantan "))%>%
mutate(matching_var = str_replace(matching_var, "Amurskaja oblast", "Amur"))
我也试过这个方法来处理空格,但仍然没有:
data$matching_var <-str_replace_all(data$matching_var, fixed(" "), " ")
输出看起来和输入一模一样,我不明白为什么。感谢您的帮助。
如果要用固定字符串进行替换,一种选择是
merge
/left_join
转换 table:conversions <- tribble( ~fm, ~ to, "MY Johor" , "MY Johor", "Los Ríos" , "Los Rios", "MY Kedah" , "MY Kedah", "MY Kelantan" , "MY Kelantan ", "Amurskaja oblast" , "Amur") left_join(data, conversions, by = c("matching_var" = "fm")) %>% mutate( new_matching_var = coalesce(to, matching_var) ) %>% select(-to) # matching_var new_matching_var # 1 BD Mymensingh BD Mymensingh # 2 CN Nei Mongol CN Nei Mongol # 3 EC Los Ríos EC Los Ríos # 4 MY Johor MY Johor # 5 MY Kedah MY Kedah # 6 MY Kelantan MY Kelantan # 7 MY Negeri Sembilan MY Negeri Sembilan # 8 RU Amurskaja oblast RU Amurskaja oblast
(请注意,
"MY Kelantan "
有一个尾随的 space,您已将其添加到自己的代码中。)另一种选择是使用
conversions
和match
,而不使用 merge/join:data %>% mutate( ind = match(matching_var, conversions$fm), new_matching_var = if_else(is.na(ind), matching_var, conversions$to[ind]) ) %>% select(-ind) # matching_var new_matching_var # 44 BD Mymensingh BD Mymensingh # 174 CN Nei Mongol CN Nei Mongol # 259 EC Los Ríos EC Los Ríos # 694 MY Johor MY Johor # 695 MY Kedah MY Kedah # 696 MY Kelantan MY Kelantan # 700 MY Negeri Sembilan MY Negeri Sembilan # 1029 RU Amurskaja oblast RU Amurskaja oblast
虽然这可以使用命名向量来完成,但我推荐它的一个原因是它易于维护(例如,作为目录中的 CSV;您可以使用 Excel 或 Calc to edit/maintain 您想要的转化列表)。
对于删除空白-space,可以使用
trimws
从字符串的开头或结尾删除多余的space。我会继续data %>% mutate( ind = match(matching_var, conversions$fm), new_matching_var = if_else(is.na(ind), matching_var, conversions$to[ind]) ) %>% select(-ind) %>% mutate(new_matching_var2 = trimws(new_matching_var)) # matching_var new_matching_var new_matching_var2 # 44 BD Mymensingh BD Mymensingh BD Mymensingh # 174 CN Nei Mongol CN Nei Mongol CN Nei Mongol # 259 EC Los Ríos EC Los Ríos EC Los Ríos # 694 MY Johor MY Johor MY Johor # 695 MY Kedah MY Kedah MY Kedah # 696 MY Kelantan MY Kelantan MY Kelantan # 700 MY Negeri Sembilan MY Negeri Sembilan MY Negeri Sembilan # 1029 RU Amurskaja oblast RU Amurskaja oblast RU Amurskaja oblast
虽然我不知道什么对
str_replace_all(data$matching_var, fixed(" "), " ")
不起作用,但 确实 删除了我的所有文字中间字符串" "
看到了(嗯,在这一点上,没有双 space 剩余,但它会有)。
您可以用更简单的两步清理来替换所有这些,假设您不需要以其他方式更改值:
data %>%
mutate(
new_matching_var = trimws(str_replace(matching_var, fixed(" "), " "))
)
# matching_var new_matching_var
# 44 BD Mymensingh BD Mymensingh
# 174 CN Nei Mongol CN Nei Mongol
# 259 EC Los Ríos EC Los Ríos
# 694 MY Johor MY Johor
# 695 MY Kedah MY Kedah
# 696 MY Kelantan MY Kelantan
# 700 MY Negeri Sembilan MY Negeri Sembilan
# 1029 RU Amurskaja oblast RU Amurskaja oblast