R - 模糊查找和重新编码
R - Fuzzy find and recode
我正在清理 10 多个学区提交的人口统计数据,但提交的不是 standardized/uniform。我想找到模式并重新编码,以便数据干净简单。
假设我有一个名为 Race
的变量,其中一个类别是 Native Hawaiian - Pacific Islander
。
学校 A 提交此类别为 Native Hawaiian or Other Pacific Islander
。学校 B 提交此类别为 Native Hawaiian/Pacific Islander
。学校 C 将此类别提交为 Native Hawaiian or Pacific Islander
.
我如何重新编码,如果 R 在变量中的任何地方看到单词 Pacific
,它将重新编码为 Native Hawaiian - Pacific Islander
?
这里是原始数据:
df_original <- data.frame(Race=c("Native Hawaiian or Other Pacific Islander",
"Native Hawaiian/Pacific Islander", "Native Hawaiian or Pacific Islander",
"Black or African American", "Black", "Black/African American"))
这是理想的清理数据:
df_desired <- data.frame(Race=c("Native Hawaiian - Pacific Islander","Native Hawaiian - Pacific Islander",
"Native Hawaiian - Pacific Islander","Black - African American",
"Black - African American","Black - African American"))
对于包含“Pacific”的字符串,grepl()
将 return TRUE
,否则 False
。使用它来对您的矢量进行子集化并替换为您想要的字符串:
df_original$Race[grepl("Pacific", df_original$Race)] <- "Native Hawaiian - Pacific Islander"
将 str_detect
与 case_when
结合使用
library(dplyr)
library(stringr)
df_original %>%
mutate(Race2 = case_when(str_detect(Race, '\bPacific\b') ~
"Native Hawaiian - Pacific Islander",
TRUE ~ "Black - African American"))
-输出
Race Race2
1 Native Hawaiian or Other Pacific Islander Native Hawaiian - Pacific Islander
2 Native Hawaiian/Pacific Islander Native Hawaiian - Pacific Islander
3 Native Hawaiian or Pacific Islander Native Hawaiian - Pacific Islander
4 Black or African American Black - African American
5 Black Black - African American
6 Black/African American Black - African American
另一种选择是创建一个 key/value 数据集,其中包含要替换的模式及其相应的值,然后使用原始数据 [=17] 执行 regex_left_join
(来自 fuzzyjoin
) =]
library(fuzzyjoin)
keydat <- tibble(Race = c("Pacific", "Black"),
Race2 = c("Native Hawaiian - Pacific Islander", "Black - African American"))
regex_left_join(df_original, keydat) %>%
transmute(Race = Race2)
#Joining by: "Race"
# Race
#1 Native Hawaiian - Pacific Islander
#2 Native Hawaiian - Pacific Islander
#3 Native Hawaiian - Pacific Islander
#4 Black - African American
#5 Black - African American
#6 Black - African American
我正在清理 10 多个学区提交的人口统计数据,但提交的不是 standardized/uniform。我想找到模式并重新编码,以便数据干净简单。
假设我有一个名为 Race
的变量,其中一个类别是 Native Hawaiian - Pacific Islander
。
学校 A 提交此类别为 Native Hawaiian or Other Pacific Islander
。学校 B 提交此类别为 Native Hawaiian/Pacific Islander
。学校 C 将此类别提交为 Native Hawaiian or Pacific Islander
.
我如何重新编码,如果 R 在变量中的任何地方看到单词 Pacific
,它将重新编码为 Native Hawaiian - Pacific Islander
?
这里是原始数据:
df_original <- data.frame(Race=c("Native Hawaiian or Other Pacific Islander",
"Native Hawaiian/Pacific Islander", "Native Hawaiian or Pacific Islander",
"Black or African American", "Black", "Black/African American"))
这是理想的清理数据:
df_desired <- data.frame(Race=c("Native Hawaiian - Pacific Islander","Native Hawaiian - Pacific Islander",
"Native Hawaiian - Pacific Islander","Black - African American",
"Black - African American","Black - African American"))
grepl()
将 return TRUE
,否则 False
。使用它来对您的矢量进行子集化并替换为您想要的字符串:
df_original$Race[grepl("Pacific", df_original$Race)] <- "Native Hawaiian - Pacific Islander"
将 str_detect
与 case_when
library(dplyr)
library(stringr)
df_original %>%
mutate(Race2 = case_when(str_detect(Race, '\bPacific\b') ~
"Native Hawaiian - Pacific Islander",
TRUE ~ "Black - African American"))
-输出
Race Race2
1 Native Hawaiian or Other Pacific Islander Native Hawaiian - Pacific Islander
2 Native Hawaiian/Pacific Islander Native Hawaiian - Pacific Islander
3 Native Hawaiian or Pacific Islander Native Hawaiian - Pacific Islander
4 Black or African American Black - African American
5 Black Black - African American
6 Black/African American Black - African American
另一种选择是创建一个 key/value 数据集,其中包含要替换的模式及其相应的值,然后使用原始数据 [=17] 执行 regex_left_join
(来自 fuzzyjoin
) =]
library(fuzzyjoin)
keydat <- tibble(Race = c("Pacific", "Black"),
Race2 = c("Native Hawaiian - Pacific Islander", "Black - African American"))
regex_left_join(df_original, keydat) %>%
transmute(Race = Race2)
#Joining by: "Race"
# Race
#1 Native Hawaiian - Pacific Islander
#2 Native Hawaiian - Pacific Islander
#3 Native Hawaiian - Pacific Islander
#4 Black - African American
#5 Black - African American
#6 Black - African American