用于过滤数据帧的 stringr::str_detect 的模糊版本

fuzzy version of stringr::str_detect for filtering dataframe

我有一个包含自由文本字段的数据库,我想将其用于 filter data.frametibble。我也许可以通过大量工作创建一个列表,列出当前出现在数据中的搜索词的所有可能拼写错误(请参见下面一个词的所有拼写示例),然后我可以使用 stringr::str_detect 作为在下面的示例代码中。但是,当将来可能出现更多拼写错误时,这并不安全。如果我愿意接受一些限制/做出一些假设(例如,拼写错误之间的编辑距离可能有多远,或者就其他差异而言,人们不会使用完全不同的术语等),是否有一些做 str_detect?

模糊版本的简单解决方案

据我所知,像 stringdist 这样明显的包似乎没有直接执行此操作的功能。我想我可以编写自己的函数,将 stringdist::afindstringdist::amatch 之类的东西应用于向量的每个元素,然后 post 处理结果最终 return TRUEFALSE 布尔值,但我想知道这个函数是否在某个地方不存在(并且比我做的更有效地实现)。

这是一个示例,说明如何使用 str_detect 我可能会错过我想要的一行:

library(tidyverse)

search_terms = c("preclinical", "Preclincal", "Preclincial", "Preclinial", 
                 "Precllinical", "Preclilnical", "Preclinica", "Preclnical", 
                 "Peclinical", "Prclinical", "Peeclinical", "Pre clinical", 
                 "Precclinical", "Preclicnial", "Precliical", "Precliinical", 
                 "Preclinal", "Preclincail", "Preclinicgal", "Priclinical")

example_data = tibble(project=c("A111", "A123", "B112", "A224", "C149"),
                      disease_phase=c("Diabetes, Preclinical", "Lipid lowering, Perlcinical", 
                                      "Asthma, Phase I", "Phase II; Hypertension", "Phase 3"),
                      startdate = c("01DEC2018", "17-OKT-2017", "11/15/2019", "1. Dezember 2004", "2005-11-30")) 

# Finds only project A111, but not A123
example_data %>%
  filter(str_detect(tolower(disease_phase), paste0(tolower(search_terms), collapse="|")))

Damerau–Levenshtein distance 是测量拼写错误时字符串距离的不错选择。在下面的代码中,我拆分了 disease_phase 并查看是否有任何子字符串与 "preclinical".

紧密匹配
library(tidyverse)
library(stringdist)

example_data |> 
  filter(str_split(disease_phase, "\W+") |> 
           map_lgl(\(x) x |> 
                 stringdist("preclinical", "dl") |> 
                 (`<=`)(4) |> # Threshold for distance
                 any()
               )
         )
#> # A tibble: 2 × 3
#>   project disease_phase               startdate  
#>   <chr>   <chr>                       <chr>      
#> 1 A111    Diabetes, Preclinical       01DEC2018  
#> 2 A123    Lipid lowering, Perlcinical 17-OKT-2017

我选择了一个相当保守的阈值距离 <=4,因为正如您在下面看到的,您的错字示例都低于该值。您可能需要做一些测试以获得良好的阈值。

stringdist(search_terms, "preclinical")
#>  [1] 0 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 3 3 2 2

reprex package (v2.0.1)

于 2022-04-23 创建

编辑:

正如我在对 JBGruber 的回答的评论中所述,长而不是嵌套具有显着的性能优势。所以最好这样做:

example_large |>
  tidytext::unnest_tokens(word, disease_phase, drop = F) |>
  mutate(str_dist = stringdist(word, "preclinical", method = "dl")) |>
  filter(str_dist < 4) |>
  group_by(project, disease_phase) |>
  slice(which.min(str_dist))

"preclinical" 在同一字符串中出现两次时,最后两行是为了避免潜在的重复,样本数据中没有出现这种情况,但在大型人工生成的数据集中并非不可能。

您可以使用 agrepl 进行近似字符串匹配(模糊匹配),它位于 base.

example_data[agrep(paste(search_terms, collapse = "|"),
  example_data$disease_phase, 2, ignore.case=TRUE, fixed=FALSE),]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

或者在正则表达式中使用 Reduce 而不是 |

example_data[Reduce(\(y, x) y | agrepl(x, example_data$disease_phase, 2,
  ignore.case=TRUE), search_terms, FALSE),]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

替代方案可能是 adist,也在 base 中,它计算距离矩阵 - 因此可能不推荐用于较大的向量,因为矩阵可以得到大。这里我也选择不匹配2个字符就可以了

example_data[colSums(adist(unique(search_terms), example_data$disease_phase,
                           partial=TRUE) < 3) > 0,]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

如果只比较 单个 个词,它可能更有效,所以在 中也使用 strsplitdisease_phase 拆分成词基础.

. <- strsplit(example_data$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_data[unique(unlist(.[Reduce(\(y, x) `[<-`(y, !y, agrepl(x, names(.)[!y],
   2)), tolower(search_terms), logical(length(.)))], FALSE, FALSE)),]
#example_data[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
#   tolower(search_terms), FALSE)], FALSE, FALSE)),] #Alternative
#  project               disease_phase   startdate
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017
#1    A111       Diabetes, Preclinical   01DEC2018

使用agrep的一些更简单的示例:

#Allow 1 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 1)
#[1]  TRUE  TRUE FALSE

#Allow 2 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 2)
#[1] TRUE TRUE TRUE

#Use boundaries to match words
agrepl("\bpreclinical\b", c("xyz precinical xyz", "xyzpreclinicalxyz"), 1, fixed=FALSE)
#[1]  TRUE FALSE

允许的差异可以用max.distance:

设置
max.distance: Maximum distance allowed for a match.  Expressed either
          as integer, or as a fraction of the _pattern_ length times
          the maximal transformation cost (will be replaced by the
          smallest integer not less than the corresponding fraction),
          or a list with possible components

          ‘cost’: maximum number/fraction of match cost (generalized
              Levenshtein distance)

          ‘all’: maximal number/fraction of _all_ transformations
              (insertions, deletions and substitutions)

          ‘insertions’: maximum number/fraction of insertions

          ‘deletions’: maximum number/fraction of deletions

          ‘substitutions’: maximum number/fraction of substitutions

还有一个基于@JBGruber 的基准:

system.time({  #Libraries needed for method of JBGruber
library(dplyr);
library(stringdist);
library(Rfast);
library(tidytext)
})
#       User      System verstrichen 
#      1.008       0.040       1.046 

set.seed(42)
example_large <- example_data %>% sample_n(5000, replace = TRUE)

stringdist_detect <- function(a, b, method = "osa", thres = 2) {
  Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}

bench::mark(check = FALSE,
  stringdist_detect = {
     example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
  },
  GKi ={. <- strsplit(example_large$disease_phase, "[ ,;]+")
   . <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
   example_large[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
     tolower(search_terms), FALSE)], FALSE, FALSE)),]
})
#  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 stringdist_detect  17.42ms  18.65ms      52.8    7.15MB    19.4     19     7
#2 GKi                 5.64ms   6.04ms     165.   869.08KB     6.27    79     3

如果只有一个,写对了,search_terms.

中感兴趣的单词的变体,也可以节省很多时间

我认为最efficient/fastest的方式是这样的:

stringdist_detect <- function(a, b, method = "osa", thres = 2) {
  Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}

stringdist::stringdistmatrix计算所有之间的距离矩阵 a 和 b 中的值。我从没听说过 Rfast::colMins 但谷歌搜索了一下 告诉我这是在 a 的每一行中找到最小值的最快方法 矩阵(apply(x, 2, min) 会完成同样的事情)。仅此而已 我们想要:最小值,因为它告诉我们单词之间的最小距离 在 a 和 b。我们可以将其与阈值进行比较。看着 ?stringdist::stringdist-metrics 有关方法参数的更多信息。 我只是简单地听从了@shs 的建议,这似乎是合理的。

现在我要做的第二件事是在比较距离之前对文本进行标记化,因为在标记中查找拼写错误更有意义。 tidytext::unnest_tokens 是一个很好的函数,可以将文本拆分为单词(即标记化):

example_data %>%
  tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
  filter(stringdist_detect(word, tolower(search_terms)))
## # A tibble: 2 × 4
##   project disease_phase               startdate   word       
##   <chr>   <chr>                       <chr>       <chr>      
## 1 A111    Diabetes, Preclinical       01DEC2018   preclinical
## 2 A123    Lipid lowering, Perlcinical 17-OKT-2017 perlcinical

代币化还有一个额外的优势,那就是你有一个专栏告诉你 匹配了哪个词。这应该使测试不同 门槛容易多了。但是,正如@shs 所建议的那样,如果发现两个拼写错误,您会得到一些重复。您可以在下一部分中使用 filter(!duplicated(project)) 来消除重复的拼写错误。

如果不想自己定义函数,也可以按照 @Maël 的建议。在这里拼写:

search_terms <- data.frame(word = search_terms)
example_data %>%
  tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
  fuzzyjoin::stringdist_inner_join(search_terms, by = "word", max_dist = 2) %>% 
  filter(!duplicated(project))
## # A tibble: 2 × 5
##   project disease_phase               startdate   word.x      word.y     
##   <chr>   <chr>                       <chr>       <chr>       <chr>      
## 1 A111    Diabetes, Preclinical       01DEC2018   preclinical preclinical
## 2 A123    Lipid lowering, Perlcinical 17-OKT-2017 perlcinical preclinical

基准

example_large <- example_data %>%
      # upsample for more realistic scenario
      sample_n(5000, replace = TRUE)

res <- bench::mark(
  stringdist_detect = {
     example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
  },
  fuzzyjoin = {
    example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      fuzzyjoin::stringdist_inner_join(data.frame(word = search_terms), by = "word", max_dist = 2) %>% 
      filter(!duplicated(project))
  },
  agrepl = {
    example_large %>% 
      filter(agrepl(paste(search_terms, collapse = "|"), disease_phase, 2, ignore.case=TRUE, fixed=FALSE))
  },
  agrepl_reduce = {
    example_large[Reduce(\(y, x) y | agrepl(x, example_large$disease_phase, 2,
                                           ignore.case=TRUE), search_terms, FALSE),]
  },
  check = FALSE
)
summary(res)
## # A tibble: 4 × 6
##   expression             min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 stringdist_detect   21.3ms   23.3ms     42.8         NA     13.4
## 2 fuzzyjoin           57.4ms   60.1ms     16.8         NA     13.4
## 3 agrepl             224.7ms  226.4ms      4.33        NA      0  
## 4 agrepl_reduce        229ms  229.1ms      4.36        NA      0
summary(res, relative =TRUE)
## # A tibble: 4 × 6
##   expression          min median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>        <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
## 1 stringdist_detect  1      1         9.88        NA      Inf
## 2 fuzzyjoin          2.70   2.59      3.88        NA      Inf
## 3 agrepl            10.6    9.73      1           NA      NaN
## 4 agrepl_reduce     10.8    9.85      1.01        NA      NaN

如您所见,stringdist_detect 是最快的,其次是 fuzzyjoin(它也在底层使用 stringdist)。我还包括了@GKi 使用 agrepl 的方法。在较小的数据集上,agrepl 实际上更快,但我认为您的真实数据集中可能有超过 5 行。在您的数据中尝试这些函数并返回报告不会有什么坏处。