R - 查找包含所有搜索词的所有向量元素的快速方法
R - fast way to find all vector elements that contain all search terms
我在这里回答了同样的问题 。但建议的解决方案花费的时间太长。
我有 73,360 个带句子的观察结果。对于包含所有搜索字符串的匹配项,我想要一个 TRUE return。
sentences <- c("blue green red",
"blue green yellow",
"green red yellow ")
search_terms <- c("blue","red")
pattern <- paste0("(?=.*", search_terms,")", collapse="")
grepl(pattern, sentences, perl = TRUE)
-output
[1] TRUE FALSE FALSE
这给出了正确的结果,但是需要非常非常长的时间。有没有更快的方法?我尝试了 str_detect
并得到了相同的延迟结果。
顺便说一句,“句子”包含 [],.-
等特殊字符,但没有 ñ
.
等特殊字符
已更新:感谢@onyambu 的输入,以下是我使用建议方法得到的基准测试结果。
Unit: milliseconds
expr min lq mean median uq max neval
OP_solution() 7033.7550 7152.0689 7277.8248 7251.8419 7391.8664 7690.964 100
map_str_detect() 2239.8715 2292.1271 2357.7432 2348.9975 2397.1758 2774.349 100
unlist_lapply_fixed() 308.1492 331.9948 345.6262 339.9935 348.9907 586.169 100
Reduce_lapply 赢了!谢谢@onyambu
Unit: milliseconds
expr min lq mean median uq max neval
Reduce_lapply() 49.02941 53.61291 55.96418 55.31494 56.76109 80.64735 100
unlist_lapply_fixed() 318.25518 335.58883 362.03831 346.71509 357.97142 566.95738 100
您可以尝试混合使用 purrr
和 stringr
函数来解决:
library(tidyverse)
purrr::map_lgl(
.x = sentences,
.f = ~ all(stringr::str_detect(.x, search_terms))
)
编辑:
另一种选择是循环搜索模式而不是循环遍历句子:
使用:
Reduce("&", lapply(search_terms, grepl, sentences, fixed = TRUE))
[1] TRUE FALSE FALSE
基准
Unit: milliseconds
expr min lq mean median uq max neval
OP_solution() 80.6365 81.61575 85.76427 83.20265 87.32975 163.0302 100
map_str_detect() 546.4681 563.08570 596.26190 571.52185 603.03980 1383.7969 100
unlist_lapply_fixed() 61.8119 67.49450 71.41485 69.56290 73.77240 104.8399 100
Reduce_lapply() 3.0604 3.11205 3.406012 3.14535 3.43130 6.3526 100
请注意,这非常快!
旧 POST:
使用all
函数如下图:
unlist(lapply(strsplit(sentences, " ", fixed = TRUE), \(x)all(search_terms %in% x)))
基准:
OP_solution <- function(){
pattern <- paste0("(?=.*", search_terms,")", collapse="")
grepl(pattern, sentences, perl = TRUE)
}
map_str_detect <- function(){
purrr::map_lgl(
.x = sentences,
.f = ~ all(stringr::str_detect(.x, search_terms))
)
}
unlist_lapply_fixed <- function() unlist(lapply(strsplit(sentences, " ", fixed = TRUE), \(x)all(search_terms %in% x)))
sentences <- rep(sentences, 10000)
microbenchmark::microbenchmark( OP_solution(),map_str_detect(),
unlist_lapply_fixed(), check = 'equal')
Unit: milliseconds
expr min lq mean median uq max neval
OP_solution() 80.5368 81.40265 85.14451 82.73985 86.41345 118.7052 100
map_str_detect() 542.3555 553.84080 587.15748 566.66570 607.77130 782.5189 100
unlist_lapply_fixed() 60.4955 66.94420 71.94195 69.30135 72.16735 113.6567 100
我在这里回答了同样的问题
我有 73,360 个带句子的观察结果。对于包含所有搜索字符串的匹配项,我想要一个 TRUE return。
sentences <- c("blue green red",
"blue green yellow",
"green red yellow ")
search_terms <- c("blue","red")
pattern <- paste0("(?=.*", search_terms,")", collapse="")
grepl(pattern, sentences, perl = TRUE)
-output
[1] TRUE FALSE FALSE
这给出了正确的结果,但是需要非常非常长的时间。有没有更快的方法?我尝试了 str_detect
并得到了相同的延迟结果。
顺便说一句,“句子”包含 [],.-
等特殊字符,但没有 ñ
.
已更新:感谢@onyambu 的输入,以下是我使用建议方法得到的基准测试结果。
Unit: milliseconds
expr min lq mean median uq max neval
OP_solution() 7033.7550 7152.0689 7277.8248 7251.8419 7391.8664 7690.964 100
map_str_detect() 2239.8715 2292.1271 2357.7432 2348.9975 2397.1758 2774.349 100
unlist_lapply_fixed() 308.1492 331.9948 345.6262 339.9935 348.9907 586.169 100
Reduce_lapply 赢了!谢谢@onyambu
Unit: milliseconds
expr min lq mean median uq max neval
Reduce_lapply() 49.02941 53.61291 55.96418 55.31494 56.76109 80.64735 100
unlist_lapply_fixed() 318.25518 335.58883 362.03831 346.71509 357.97142 566.95738 100
您可以尝试混合使用 purrr
和 stringr
函数来解决:
library(tidyverse)
purrr::map_lgl(
.x = sentences,
.f = ~ all(stringr::str_detect(.x, search_terms))
)
编辑: 另一种选择是循环搜索模式而不是循环遍历句子:
使用:
Reduce("&", lapply(search_terms, grepl, sentences, fixed = TRUE))
[1] TRUE FALSE FALSE
基准
Unit: milliseconds
expr min lq mean median uq max neval
OP_solution() 80.6365 81.61575 85.76427 83.20265 87.32975 163.0302 100
map_str_detect() 546.4681 563.08570 596.26190 571.52185 603.03980 1383.7969 100
unlist_lapply_fixed() 61.8119 67.49450 71.41485 69.56290 73.77240 104.8399 100
Reduce_lapply() 3.0604 3.11205 3.406012 3.14535 3.43130 6.3526 100
请注意,这非常快!
旧 POST:
使用all
函数如下图:
unlist(lapply(strsplit(sentences, " ", fixed = TRUE), \(x)all(search_terms %in% x)))
基准:
OP_solution <- function(){
pattern <- paste0("(?=.*", search_terms,")", collapse="")
grepl(pattern, sentences, perl = TRUE)
}
map_str_detect <- function(){
purrr::map_lgl(
.x = sentences,
.f = ~ all(stringr::str_detect(.x, search_terms))
)
}
unlist_lapply_fixed <- function() unlist(lapply(strsplit(sentences, " ", fixed = TRUE), \(x)all(search_terms %in% x)))
sentences <- rep(sentences, 10000)
microbenchmark::microbenchmark( OP_solution(),map_str_detect(),
unlist_lapply_fixed(), check = 'equal')
Unit: milliseconds
expr min lq mean median uq max neval
OP_solution() 80.5368 81.40265 85.14451 82.73985 86.41345 118.7052 100
map_str_detect() 542.3555 553.84080 587.15748 566.66570 607.77130 782.5189 100
unlist_lapply_fixed() 60.4955 66.94420 71.94195 69.30135 72.16735 113.6567 100