用于过滤数据帧的 stringr::str_detect 的模糊版本
fuzzy version of stringr::str_detect for filtering dataframe
我有一个包含自由文本字段的数据库,我想将其用于 filter
data.frame
或 tibble
。我也许可以通过大量工作创建一个列表,列出当前出现在数据中的搜索词的所有可能拼写错误(请参见下面一个词的所有拼写示例),然后我可以使用 stringr::str_detect
作为在下面的示例代码中。但是,当将来可能出现更多拼写错误时,这并不安全。如果我愿意接受一些限制/做出一些假设(例如,拼写错误之间的编辑距离可能有多远,或者就其他差异而言,人们不会使用完全不同的术语等),是否有一些做 str_detect
?
模糊版本的简单解决方案
据我所知,像 stringdist
这样明显的包似乎没有直接执行此操作的功能。我想我可以编写自己的函数,将 stringdist::afind
或 stringdist::amatch
之类的东西应用于向量的每个元素,然后 post 处理结果最终 return TRUE
或 FALSE
布尔值,但我想知道这个函数是否在某个地方不存在(并且比我做的更有效地实现)。
这是一个示例,说明如何使用 str_detect
我可能会错过我想要的一行:
library(tidyverse)
search_terms = c("preclinical", "Preclincal", "Preclincial", "Preclinial",
"Precllinical", "Preclilnical", "Preclinica", "Preclnical",
"Peclinical", "Prclinical", "Peeclinical", "Pre clinical",
"Precclinical", "Preclicnial", "Precliical", "Precliinical",
"Preclinal", "Preclincail", "Preclinicgal", "Priclinical")
example_data = tibble(project=c("A111", "A123", "B112", "A224", "C149"),
disease_phase=c("Diabetes, Preclinical", "Lipid lowering, Perlcinical",
"Asthma, Phase I", "Phase II; Hypertension", "Phase 3"),
startdate = c("01DEC2018", "17-OKT-2017", "11/15/2019", "1. Dezember 2004", "2005-11-30"))
# Finds only project A111, but not A123
example_data %>%
filter(str_detect(tolower(disease_phase), paste0(tolower(search_terms), collapse="|")))
Damerau–Levenshtein distance 是测量拼写错误时字符串距离的不错选择。在下面的代码中,我拆分了 disease_phase
并查看是否有任何子字符串与 "preclinical"
.
紧密匹配
library(tidyverse)
library(stringdist)
example_data |>
filter(str_split(disease_phase, "\W+") |>
map_lgl(\(x) x |>
stringdist("preclinical", "dl") |>
(`<=`)(4) |> # Threshold for distance
any()
)
)
#> # A tibble: 2 × 3
#> project disease_phase startdate
#> <chr> <chr> <chr>
#> 1 A111 Diabetes, Preclinical 01DEC2018
#> 2 A123 Lipid lowering, Perlcinical 17-OKT-2017
我选择了一个相当保守的阈值距离 <=4
,因为正如您在下面看到的,您的错字示例都低于该值。您可能需要做一些测试以获得良好的阈值。
stringdist(search_terms, "preclinical")
#> [1] 0 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 3 3 2 2
由 reprex package (v2.0.1)
于 2022-04-23 创建
编辑:
正如我在对 JBGruber 的回答的评论中所述,长而不是嵌套具有显着的性能优势。所以最好这样做:
example_large |>
tidytext::unnest_tokens(word, disease_phase, drop = F) |>
mutate(str_dist = stringdist(word, "preclinical", method = "dl")) |>
filter(str_dist < 4) |>
group_by(project, disease_phase) |>
slice(which.min(str_dist))
当 "preclinical"
在同一字符串中出现两次时,最后两行是为了避免潜在的重复,样本数据中没有出现这种情况,但在大型人工生成的数据集中并非不可能。
您可以使用 agrepl
进行近似字符串匹配(模糊匹配),它位于 base.
example_data[agrep(paste(search_terms, collapse = "|"),
example_data$disease_phase, 2, ignore.case=TRUE, fixed=FALSE),]
# project disease_phase startdate
#1 A111 Diabetes, Preclinical 01DEC2018
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017
或者在正则表达式中使用 Reduce
而不是 |
。
example_data[Reduce(\(y, x) y | agrepl(x, example_data$disease_phase, 2,
ignore.case=TRUE), search_terms, FALSE),]
# project disease_phase startdate
#1 A111 Diabetes, Preclinical 01DEC2018
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017
替代方案可能是 adist
,也在 base 中,它计算距离矩阵 - 因此可能不推荐用于较大的向量,因为矩阵可以得到大。这里我也选择不匹配2个字符就可以了
example_data[colSums(adist(unique(search_terms), example_data$disease_phase,
partial=TRUE) < 3) > 0,]
# project disease_phase startdate
#1 A111 Diabetes, Preclinical 01DEC2018
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017
如果只比较 单个 个词,它可能更有效,所以在 中也使用 strsplit
将 disease_phase
拆分成词基础.
. <- strsplit(example_data$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_data[unique(unlist(.[Reduce(\(y, x) `[<-`(y, !y, agrepl(x, names(.)[!y],
2)), tolower(search_terms), logical(length(.)))], FALSE, FALSE)),]
#example_data[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
# tolower(search_terms), FALSE)], FALSE, FALSE)),] #Alternative
# project disease_phase startdate
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017
#1 A111 Diabetes, Preclinical 01DEC2018
使用agrep
的一些更简单的示例:
#Allow 1 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 1)
#[1] TRUE TRUE FALSE
#Allow 2 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 2)
#[1] TRUE TRUE TRUE
#Use boundaries to match words
agrepl("\bpreclinical\b", c("xyz precinical xyz", "xyzpreclinicalxyz"), 1, fixed=FALSE)
#[1] TRUE FALSE
允许的差异可以用max.distance:
设置
max.distance: Maximum distance allowed for a match. Expressed either
as integer, or as a fraction of the _pattern_ length times
the maximal transformation cost (will be replaced by the
smallest integer not less than the corresponding fraction),
or a list with possible components
‘cost’: maximum number/fraction of match cost (generalized
Levenshtein distance)
‘all’: maximal number/fraction of _all_ transformations
(insertions, deletions and substitutions)
‘insertions’: maximum number/fraction of insertions
‘deletions’: maximum number/fraction of deletions
‘substitutions’: maximum number/fraction of substitutions
还有一个基于@JBGruber 的基准:
system.time({ #Libraries needed for method of JBGruber
library(dplyr);
library(stringdist);
library(Rfast);
library(tidytext)
})
# User System verstrichen
# 1.008 0.040 1.046
set.seed(42)
example_large <- example_data %>% sample_n(5000, replace = TRUE)
stringdist_detect <- function(a, b, method = "osa", thres = 2) {
Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}
bench::mark(check = FALSE,
stringdist_detect = {
example_large %>%
tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>%
filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
},
GKi ={. <- strsplit(example_large$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_large[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
tolower(search_terms), FALSE)], FALSE, FALSE)),]
})
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#1 stringdist_detect 17.42ms 18.65ms 52.8 7.15MB 19.4 19 7
#2 GKi 5.64ms 6.04ms 165. 869.08KB 6.27 79 3
如果只有一个,写对了,search_terms
.
中感兴趣的单词的变体,也可以节省很多时间
我认为最efficient/fastest的方式是这样的:
stringdist_detect <- function(a, b, method = "osa", thres = 2) {
Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}
stringdist::stringdistmatrix
计算所有之间的距离矩阵
a 和 b 中的值。我从没听说过 Rfast::colMins 但谷歌搜索了一下
告诉我这是在 a 的每一行中找到最小值的最快方法
矩阵(apply(x, 2, min)
会完成同样的事情)。仅此而已
我们想要:最小值,因为它告诉我们单词之间的最小距离
在 a 和 b。我们可以将其与阈值进行比较。看着
?stringdist::stringdist-metrics
有关方法参数的更多信息。
我只是简单地听从了@shs 的建议,这似乎是合理的。
现在我要做的第二件事是在比较距离之前对文本进行标记化,因为在标记中查找拼写错误更有意义。 tidytext::unnest_tokens
是一个很好的函数,可以将文本拆分为单词(即标记化):
example_data %>%
tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>%
filter(stringdist_detect(word, tolower(search_terms)))
## # A tibble: 2 × 4
## project disease_phase startdate word
## <chr> <chr> <chr> <chr>
## 1 A111 Diabetes, Preclinical 01DEC2018 preclinical
## 2 A123 Lipid lowering, Perlcinical 17-OKT-2017 perlcinical
代币化还有一个额外的优势,那就是你有一个专栏告诉你
匹配了哪个词。这应该使测试不同
门槛容易多了。但是,正如@shs 所建议的那样,如果发现两个拼写错误,您会得到一些重复。您可以在下一部分中使用 filter(!duplicated(project))
来消除重复的拼写错误。
如果不想自己定义函数,也可以按照
@Maël 的建议。在这里拼写:
search_terms <- data.frame(word = search_terms)
example_data %>%
tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>%
fuzzyjoin::stringdist_inner_join(search_terms, by = "word", max_dist = 2) %>%
filter(!duplicated(project))
## # A tibble: 2 × 5
## project disease_phase startdate word.x word.y
## <chr> <chr> <chr> <chr> <chr>
## 1 A111 Diabetes, Preclinical 01DEC2018 preclinical preclinical
## 2 A123 Lipid lowering, Perlcinical 17-OKT-2017 perlcinical preclinical
基准
example_large <- example_data %>%
# upsample for more realistic scenario
sample_n(5000, replace = TRUE)
res <- bench::mark(
stringdist_detect = {
example_large %>%
tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>%
filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
},
fuzzyjoin = {
example_large %>%
tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>%
fuzzyjoin::stringdist_inner_join(data.frame(word = search_terms), by = "word", max_dist = 2) %>%
filter(!duplicated(project))
},
agrepl = {
example_large %>%
filter(agrepl(paste(search_terms, collapse = "|"), disease_phase, 2, ignore.case=TRUE, fixed=FALSE))
},
agrepl_reduce = {
example_large[Reduce(\(y, x) y | agrepl(x, example_large$disease_phase, 2,
ignore.case=TRUE), search_terms, FALSE),]
},
check = FALSE
)
summary(res)
## # A tibble: 4 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 stringdist_detect 21.3ms 23.3ms 42.8 NA 13.4
## 2 fuzzyjoin 57.4ms 60.1ms 16.8 NA 13.4
## 3 agrepl 224.7ms 226.4ms 4.33 NA 0
## 4 agrepl_reduce 229ms 229.1ms 4.36 NA 0
summary(res, relative =TRUE)
## # A tibble: 4 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stringdist_detect 1 1 9.88 NA Inf
## 2 fuzzyjoin 2.70 2.59 3.88 NA Inf
## 3 agrepl 10.6 9.73 1 NA NaN
## 4 agrepl_reduce 10.8 9.85 1.01 NA NaN
如您所见,stringdist_detect
是最快的,其次是 fuzzyjoin
(它也在底层使用 stringdist
)。我还包括了@GKi 使用 agrepl
的方法。在较小的数据集上,agrepl
实际上更快,但我认为您的真实数据集中可能有超过 5 行。在您的数据中尝试这些函数并返回报告不会有什么坏处。
我有一个包含自由文本字段的数据库,我想将其用于 filter
data.frame
或 tibble
。我也许可以通过大量工作创建一个列表,列出当前出现在数据中的搜索词的所有可能拼写错误(请参见下面一个词的所有拼写示例),然后我可以使用 stringr::str_detect
作为在下面的示例代码中。但是,当将来可能出现更多拼写错误时,这并不安全。如果我愿意接受一些限制/做出一些假设(例如,拼写错误之间的编辑距离可能有多远,或者就其他差异而言,人们不会使用完全不同的术语等),是否有一些做 str_detect
?
据我所知,像 stringdist
这样明显的包似乎没有直接执行此操作的功能。我想我可以编写自己的函数,将 stringdist::afind
或 stringdist::amatch
之类的东西应用于向量的每个元素,然后 post 处理结果最终 return TRUE
或 FALSE
布尔值,但我想知道这个函数是否在某个地方不存在(并且比我做的更有效地实现)。
这是一个示例,说明如何使用 str_detect
我可能会错过我想要的一行:
library(tidyverse)
search_terms = c("preclinical", "Preclincal", "Preclincial", "Preclinial",
"Precllinical", "Preclilnical", "Preclinica", "Preclnical",
"Peclinical", "Prclinical", "Peeclinical", "Pre clinical",
"Precclinical", "Preclicnial", "Precliical", "Precliinical",
"Preclinal", "Preclincail", "Preclinicgal", "Priclinical")
example_data = tibble(project=c("A111", "A123", "B112", "A224", "C149"),
disease_phase=c("Diabetes, Preclinical", "Lipid lowering, Perlcinical",
"Asthma, Phase I", "Phase II; Hypertension", "Phase 3"),
startdate = c("01DEC2018", "17-OKT-2017", "11/15/2019", "1. Dezember 2004", "2005-11-30"))
# Finds only project A111, but not A123
example_data %>%
filter(str_detect(tolower(disease_phase), paste0(tolower(search_terms), collapse="|")))
Damerau–Levenshtein distance 是测量拼写错误时字符串距离的不错选择。在下面的代码中,我拆分了 disease_phase
并查看是否有任何子字符串与 "preclinical"
.
library(tidyverse)
library(stringdist)
example_data |>
filter(str_split(disease_phase, "\W+") |>
map_lgl(\(x) x |>
stringdist("preclinical", "dl") |>
(`<=`)(4) |> # Threshold for distance
any()
)
)
#> # A tibble: 2 × 3
#> project disease_phase startdate
#> <chr> <chr> <chr>
#> 1 A111 Diabetes, Preclinical 01DEC2018
#> 2 A123 Lipid lowering, Perlcinical 17-OKT-2017
我选择了一个相当保守的阈值距离 <=4
,因为正如您在下面看到的,您的错字示例都低于该值。您可能需要做一些测试以获得良好的阈值。
stringdist(search_terms, "preclinical")
#> [1] 0 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 3 3 2 2
由 reprex package (v2.0.1)
于 2022-04-23 创建编辑:
正如我在对 JBGruber 的回答的评论中所述,长而不是嵌套具有显着的性能优势。所以最好这样做:
example_large |>
tidytext::unnest_tokens(word, disease_phase, drop = F) |>
mutate(str_dist = stringdist(word, "preclinical", method = "dl")) |>
filter(str_dist < 4) |>
group_by(project, disease_phase) |>
slice(which.min(str_dist))
当 "preclinical"
在同一字符串中出现两次时,最后两行是为了避免潜在的重复,样本数据中没有出现这种情况,但在大型人工生成的数据集中并非不可能。
您可以使用 agrepl
进行近似字符串匹配(模糊匹配),它位于 base.
example_data[agrep(paste(search_terms, collapse = "|"),
example_data$disease_phase, 2, ignore.case=TRUE, fixed=FALSE),]
# project disease_phase startdate
#1 A111 Diabetes, Preclinical 01DEC2018
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017
或者在正则表达式中使用 Reduce
而不是 |
。
example_data[Reduce(\(y, x) y | agrepl(x, example_data$disease_phase, 2,
ignore.case=TRUE), search_terms, FALSE),]
# project disease_phase startdate
#1 A111 Diabetes, Preclinical 01DEC2018
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017
替代方案可能是 adist
,也在 base 中,它计算距离矩阵 - 因此可能不推荐用于较大的向量,因为矩阵可以得到大。这里我也选择不匹配2个字符就可以了
example_data[colSums(adist(unique(search_terms), example_data$disease_phase,
partial=TRUE) < 3) > 0,]
# project disease_phase startdate
#1 A111 Diabetes, Preclinical 01DEC2018
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017
如果只比较 单个 个词,它可能更有效,所以在 中也使用 strsplit
将 disease_phase
拆分成词基础.
. <- strsplit(example_data$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_data[unique(unlist(.[Reduce(\(y, x) `[<-`(y, !y, agrepl(x, names(.)[!y],
2)), tolower(search_terms), logical(length(.)))], FALSE, FALSE)),]
#example_data[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
# tolower(search_terms), FALSE)], FALSE, FALSE)),] #Alternative
# project disease_phase startdate
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017
#1 A111 Diabetes, Preclinical 01DEC2018
使用agrep
的一些更简单的示例:
#Allow 1 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 1)
#[1] TRUE TRUE FALSE
#Allow 2 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 2)
#[1] TRUE TRUE TRUE
#Use boundaries to match words
agrepl("\bpreclinical\b", c("xyz precinical xyz", "xyzpreclinicalxyz"), 1, fixed=FALSE)
#[1] TRUE FALSE
允许的差异可以用max.distance:
设置max.distance: Maximum distance allowed for a match. Expressed either
as integer, or as a fraction of the _pattern_ length times
the maximal transformation cost (will be replaced by the
smallest integer not less than the corresponding fraction),
or a list with possible components
‘cost’: maximum number/fraction of match cost (generalized
Levenshtein distance)
‘all’: maximal number/fraction of _all_ transformations
(insertions, deletions and substitutions)
‘insertions’: maximum number/fraction of insertions
‘deletions’: maximum number/fraction of deletions
‘substitutions’: maximum number/fraction of substitutions
还有一个基于@JBGruber 的基准:
system.time({ #Libraries needed for method of JBGruber
library(dplyr);
library(stringdist);
library(Rfast);
library(tidytext)
})
# User System verstrichen
# 1.008 0.040 1.046
set.seed(42)
example_large <- example_data %>% sample_n(5000, replace = TRUE)
stringdist_detect <- function(a, b, method = "osa", thres = 2) {
Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}
bench::mark(check = FALSE,
stringdist_detect = {
example_large %>%
tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>%
filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
},
GKi ={. <- strsplit(example_large$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_large[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
tolower(search_terms), FALSE)], FALSE, FALSE)),]
})
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#1 stringdist_detect 17.42ms 18.65ms 52.8 7.15MB 19.4 19 7
#2 GKi 5.64ms 6.04ms 165. 869.08KB 6.27 79 3
如果只有一个,写对了,search_terms
.
我认为最efficient/fastest的方式是这样的:
stringdist_detect <- function(a, b, method = "osa", thres = 2) {
Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}
stringdist::stringdistmatrix
计算所有之间的距离矩阵
a 和 b 中的值。我从没听说过 Rfast::colMins 但谷歌搜索了一下
告诉我这是在 a 的每一行中找到最小值的最快方法
矩阵(apply(x, 2, min)
会完成同样的事情)。仅此而已
我们想要:最小值,因为它告诉我们单词之间的最小距离
在 a 和 b。我们可以将其与阈值进行比较。看着
?stringdist::stringdist-metrics
有关方法参数的更多信息。
我只是简单地听从了@shs 的建议,这似乎是合理的。
现在我要做的第二件事是在比较距离之前对文本进行标记化,因为在标记中查找拼写错误更有意义。 tidytext::unnest_tokens
是一个很好的函数,可以将文本拆分为单词(即标记化):
example_data %>%
tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>%
filter(stringdist_detect(word, tolower(search_terms)))
## # A tibble: 2 × 4
## project disease_phase startdate word
## <chr> <chr> <chr> <chr>
## 1 A111 Diabetes, Preclinical 01DEC2018 preclinical
## 2 A123 Lipid lowering, Perlcinical 17-OKT-2017 perlcinical
代币化还有一个额外的优势,那就是你有一个专栏告诉你
匹配了哪个词。这应该使测试不同
门槛容易多了。但是,正如@shs 所建议的那样,如果发现两个拼写错误,您会得到一些重复。您可以在下一部分中使用 filter(!duplicated(project))
来消除重复的拼写错误。
如果不想自己定义函数,也可以按照 @Maël 的建议。在这里拼写:
search_terms <- data.frame(word = search_terms)
example_data %>%
tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>%
fuzzyjoin::stringdist_inner_join(search_terms, by = "word", max_dist = 2) %>%
filter(!duplicated(project))
## # A tibble: 2 × 5
## project disease_phase startdate word.x word.y
## <chr> <chr> <chr> <chr> <chr>
## 1 A111 Diabetes, Preclinical 01DEC2018 preclinical preclinical
## 2 A123 Lipid lowering, Perlcinical 17-OKT-2017 perlcinical preclinical
基准
example_large <- example_data %>%
# upsample for more realistic scenario
sample_n(5000, replace = TRUE)
res <- bench::mark(
stringdist_detect = {
example_large %>%
tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>%
filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
},
fuzzyjoin = {
example_large %>%
tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>%
fuzzyjoin::stringdist_inner_join(data.frame(word = search_terms), by = "word", max_dist = 2) %>%
filter(!duplicated(project))
},
agrepl = {
example_large %>%
filter(agrepl(paste(search_terms, collapse = "|"), disease_phase, 2, ignore.case=TRUE, fixed=FALSE))
},
agrepl_reduce = {
example_large[Reduce(\(y, x) y | agrepl(x, example_large$disease_phase, 2,
ignore.case=TRUE), search_terms, FALSE),]
},
check = FALSE
)
summary(res)
## # A tibble: 4 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 stringdist_detect 21.3ms 23.3ms 42.8 NA 13.4
## 2 fuzzyjoin 57.4ms 60.1ms 16.8 NA 13.4
## 3 agrepl 224.7ms 226.4ms 4.33 NA 0
## 4 agrepl_reduce 229ms 229.1ms 4.36 NA 0
summary(res, relative =TRUE)
## # A tibble: 4 × 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stringdist_detect 1 1 9.88 NA Inf
## 2 fuzzyjoin 2.70 2.59 3.88 NA Inf
## 3 agrepl 10.6 9.73 1 NA NaN
## 4 agrepl_reduce 10.8 9.85 1.01 NA NaN
如您所见,stringdist_detect
是最快的,其次是 fuzzyjoin
(它也在底层使用 stringdist
)。我还包括了@GKi 使用 agrepl
的方法。在较小的数据集上,agrepl
实际上更快,但我认为您的真实数据集中可能有超过 5 行。在您的数据中尝试这些函数并返回报告不会有什么坏处。