使用来自另一个数据框的相同字符串匹配一个字符串
Mach a string with the same string from another data frame
我有这个数据框 (DF1)
structure(list(ID = 1:3, Text = c("there was not clostridium", "clostridium difficile positive", "test was OK")), class = "data.frame", row.names = c(NA, -3L))
ID TEXT
1 "there was not clostridium"
2 "clostridium difficile positive"
3 "test was OK"
和数据框 (DF2)
structure(list(ID = 1:3, Microorganisms = c("ESCHERICHIA COLI", "CLOSTRIDIUM DIFFICILE", "FUNGI")), class = "data.frame", row.names = c(NA, -3L))
ID Microorganisms
1 ESCHERICHIA COLI
2 CLOSTRIDIUM DIFFICILE
3 FUNGI
我想用正则表达式找到匹配的 DF1 和 DF2 并将它们放入这样的新列中
ID TEXT Microorganism
1 "there was not clostridium" CLOSTRIDIUM DIFFICILE
2 "clostridium difficile positive" CLOSTRIDIUM DIFFICILE
3 "test was OK" no
我试过这样的东西
DF1 %>% mutate(Mikroorganism = ifelse(grepl(DF2$Microorganisms, TEXT), str_extract(TEXT, DF2$Microorganisms), "no"))
但事实并非如此。
一种方法是使用 fuzzyjoin
包。
DF1 %>%
fuzzyjoin::regex_left_join(
transmute(DF2, Microorganisms, ptn = gsub("\s+", "|", Microorganisms)),
by = c("Text" = "ptn"), ignore_case = TRUE) %>%
select(-ptn)
# ID Text Microorganisms
# 1 1 there was not clostridium CLOSTRIDIUM DIFFICILE
# 2 2 clostridium difficile positive CLOSTRIDIUM DIFFICILE
# 3 3 test was OK <NA>
我有这个数据框 (DF1)
structure(list(ID = 1:3, Text = c("there was not clostridium", "clostridium difficile positive", "test was OK")), class = "data.frame", row.names = c(NA, -3L))
ID TEXT
1 "there was not clostridium"
2 "clostridium difficile positive"
3 "test was OK"
和数据框 (DF2)
structure(list(ID = 1:3, Microorganisms = c("ESCHERICHIA COLI", "CLOSTRIDIUM DIFFICILE", "FUNGI")), class = "data.frame", row.names = c(NA, -3L))
ID Microorganisms
1 ESCHERICHIA COLI
2 CLOSTRIDIUM DIFFICILE
3 FUNGI
我想用正则表达式找到匹配的 DF1 和 DF2 并将它们放入这样的新列中
ID TEXT Microorganism
1 "there was not clostridium" CLOSTRIDIUM DIFFICILE
2 "clostridium difficile positive" CLOSTRIDIUM DIFFICILE
3 "test was OK" no
我试过这样的东西
DF1 %>% mutate(Mikroorganism = ifelse(grepl(DF2$Microorganisms, TEXT), str_extract(TEXT, DF2$Microorganisms), "no"))
但事实并非如此。
一种方法是使用 fuzzyjoin
包。
DF1 %>%
fuzzyjoin::regex_left_join(
transmute(DF2, Microorganisms, ptn = gsub("\s+", "|", Microorganisms)),
by = c("Text" = "ptn"), ignore_case = TRUE) %>%
select(-ptn)
# ID Text Microorganisms
# 1 1 there was not clostridium CLOSTRIDIUM DIFFICILE
# 2 2 clostridium difficile positive CLOSTRIDIUM DIFFICILE
# 3 3 test was OK <NA>