如何在 R 中使用模糊匹配连接数据?
How can I join data using a fuzzy match in R?
我有一些主题和许可数据,想创建一个列来标记许可是否适合列出的主题。额外的挑战是一些教师教授多个科目,用分号分隔,并且每个许可证都有几个可接受的科目。
我想我需要加入类似 grep 的东西,但我不太确定如何在添加此功能的同时连接两个表中的数据。
示例代码
以下是我的数据框的摘录:
df1 <- data.frame(Subject = c("Spanish Language Arts; I teach all subjects for my students",
"Math; Science", "Mathematics; ELA", "ELA", "Science;Math;English Language Arts",
"Spanish Language Arts; I teach all subjects for my students",
"Math", "Science;Social Studies;Mathematics;English Language Arts", "ELA",
"English Language Arts"),
Licensure = c("Content Area - Early Childhood (preK-Grade 3)",
"Core Subjects (Grades EC-6) 1770", "Mathematics (Grades 7-12) 1706",
"English Language Arts and Reading (Grades 7-12) 1709", "Core Subjects (Grades EC-6) 1770",
"English Language Arts and Reading (Grades 7-12) 1709",
"English Language Arts and Reading (Grades 7-12) 1709",
"Content Area - Elementary Education (Grades 1-6)",
"Mathematics (Grades 7-12) 1706", "Content Area - Elementary Education (Grades 1-6)"))
这是我创建的列表,其中包括所有许可证,每个许可证下面都有可接受的程序:
lic.subject_index <- list(
"Content Area - Early Childhood (preK-Grade 3)" = c("I teach all subjects for my students", "Math", "Mathematics", "ELA", "English Language Arts", "Language Arts"),
"Content Area - Elementary Education (Grades 1-6)" = c("I teach all subjects for my students", "Math", "Mathematics", "ELA", "English Language Arts", "Language Arts"),
"Core Subjects (Grades EC-6) 1770" = c("I teach all subjects for my students", "Math", "Mathematics", "ELA", "English Language Arts", "Language Arts"),
"English Language Arts and Reading (Grades 7-12) 1709" = c("ELA", "English Language Arts", "Language Arts"),
"Mathematics (Grades 7-12) 1706" = c("Math", "Mathematics")
)
我想做的是创建一个标记 subject/license 组合是否可接受的列:
ideal.df <- data.frame(Subject = c("Spanish Language Arts; I teach all subjects for my students",
"Math; Science", "Mathematics; ELA", "ELA", "Science;Math;English Language Arts",
"Spanish Language Arts; I teach all subjects for my students", "Math",
"Science;Social Studies;Mathematics;English Language Arts", "ELA", "English Language Arts"),
Licensure = c("Content Area - Early Childhood (preK-Grade 3)", "Core Subjects (Grades EC-6) 1770",
"Mathematics (Grades 7-12) 1706", "English Language Arts and Reading (Grades 7-12) 1709",
"Core Subjects (Grades EC-6) 1770", "English Language Arts and Reading (Grades 7-12) 1709",
"English Language Arts and Reading (Grades 7-12) 1709", "Content Area - Elementary Education (Grades 1-6)",
"Mathematics (Grades 7-12) 1706", "Content Area - Elementary Education (Grades 1-6)"),
flag = c("True", "True", "True", "True", "True", "False", "False", "True", "False", "True"))
提前感谢您提供的任何帮助!
这是 tidyverse
和 fuzzyjoin
的选项
library(fuzzyjoin)
library(tidyverse)
out <- df1 %>%
rownames_to_column('rn') %>%
separate_rows(Subject, sep = ';') %>%
stringdist_left_join(
enframe(lic.subject_index, name = 'Licensure', value = 'Subject') %>%
unnest) %>%
group_by(rn = as.integer(rn)) %>%
summarise(ind = any(!is.na(Licensure.y))) %>%
ungroup %>%
pull(ind) %>%
mutate(df1, flag = .)
out$flag
#[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
-检查OP的理想输出
as.logical(ideal.df$flag)
#[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
我有一些主题和许可数据,想创建一个列来标记许可是否适合列出的主题。额外的挑战是一些教师教授多个科目,用分号分隔,并且每个许可证都有几个可接受的科目。
我想我需要加入类似 grep 的东西,但我不太确定如何在添加此功能的同时连接两个表中的数据。
示例代码
以下是我的数据框的摘录:
df1 <- data.frame(Subject = c("Spanish Language Arts; I teach all subjects for my students",
"Math; Science", "Mathematics; ELA", "ELA", "Science;Math;English Language Arts",
"Spanish Language Arts; I teach all subjects for my students",
"Math", "Science;Social Studies;Mathematics;English Language Arts", "ELA",
"English Language Arts"),
Licensure = c("Content Area - Early Childhood (preK-Grade 3)",
"Core Subjects (Grades EC-6) 1770", "Mathematics (Grades 7-12) 1706",
"English Language Arts and Reading (Grades 7-12) 1709", "Core Subjects (Grades EC-6) 1770",
"English Language Arts and Reading (Grades 7-12) 1709",
"English Language Arts and Reading (Grades 7-12) 1709",
"Content Area - Elementary Education (Grades 1-6)",
"Mathematics (Grades 7-12) 1706", "Content Area - Elementary Education (Grades 1-6)"))
这是我创建的列表,其中包括所有许可证,每个许可证下面都有可接受的程序:
lic.subject_index <- list(
"Content Area - Early Childhood (preK-Grade 3)" = c("I teach all subjects for my students", "Math", "Mathematics", "ELA", "English Language Arts", "Language Arts"),
"Content Area - Elementary Education (Grades 1-6)" = c("I teach all subjects for my students", "Math", "Mathematics", "ELA", "English Language Arts", "Language Arts"),
"Core Subjects (Grades EC-6) 1770" = c("I teach all subjects for my students", "Math", "Mathematics", "ELA", "English Language Arts", "Language Arts"),
"English Language Arts and Reading (Grades 7-12) 1709" = c("ELA", "English Language Arts", "Language Arts"),
"Mathematics (Grades 7-12) 1706" = c("Math", "Mathematics")
)
我想做的是创建一个标记 subject/license 组合是否可接受的列:
ideal.df <- data.frame(Subject = c("Spanish Language Arts; I teach all subjects for my students",
"Math; Science", "Mathematics; ELA", "ELA", "Science;Math;English Language Arts",
"Spanish Language Arts; I teach all subjects for my students", "Math",
"Science;Social Studies;Mathematics;English Language Arts", "ELA", "English Language Arts"),
Licensure = c("Content Area - Early Childhood (preK-Grade 3)", "Core Subjects (Grades EC-6) 1770",
"Mathematics (Grades 7-12) 1706", "English Language Arts and Reading (Grades 7-12) 1709",
"Core Subjects (Grades EC-6) 1770", "English Language Arts and Reading (Grades 7-12) 1709",
"English Language Arts and Reading (Grades 7-12) 1709", "Content Area - Elementary Education (Grades 1-6)",
"Mathematics (Grades 7-12) 1706", "Content Area - Elementary Education (Grades 1-6)"),
flag = c("True", "True", "True", "True", "True", "False", "False", "True", "False", "True"))
提前感谢您提供的任何帮助!
这是 tidyverse
和 fuzzyjoin
library(fuzzyjoin)
library(tidyverse)
out <- df1 %>%
rownames_to_column('rn') %>%
separate_rows(Subject, sep = ';') %>%
stringdist_left_join(
enframe(lic.subject_index, name = 'Licensure', value = 'Subject') %>%
unnest) %>%
group_by(rn = as.integer(rn)) %>%
summarise(ind = any(!is.na(Licensure.y))) %>%
ungroup %>%
pull(ind) %>%
mutate(df1, flag = .)
out$flag
#[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
-检查OP的理想输出
as.logical(ideal.df$flag)
#[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE