过滤包含 R 中所有多个部分字符串的列表
Filter a list that contain ALL of multiple partial strings in R
我正在尝试根据用户在闪亮的应用程序中选择的一组关键字来过滤文件名列表,最终列表应该只包含包含所有部分关键字的文件
到目前为止我一直在尝试使用这段代码:
sapply(filenames, grepl, keywords)
但是如何从那个输出到全部为真的输出。
我从这个 尝试了这个解决方案,但是
all(sapply(filenames, grepl, keywords)
我的列表当然是假的。我可以编写一个列表应用函数来将 sapply(....)
应用于每个元素,但也许有更有效的方法来一次实现所有?
我也查看了 grep
或 grepl
选项,但它们只接受 OR
个参数,似乎没有 AND
。
示例关键字:
keywords <- c("Syn", "2017")
示例列表:
filenames <-
c("AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2016-06-18 13u22.csv", "AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2016-06-19 13u26.csv",
"AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2017-06-19 13u27.csv", "AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2017-06-20 13u11.csv",
"AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2018-06-21 13u12.csv", "AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2018-06-22 16u00.csv",
"AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2016-06-18 13u25.csv", "AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2016-06-19 13u29.csv",
"AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2017-06-20 13u14.csv", "AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2017-06-21 13u15.csv",
"AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2018-06-22 16u03.csv", "AdditionalListMode_M1bI Syn 60 90 90 110 2016-06-18 13u31.csv",
"AdditionalListMode_M1bI Syn 60 90 90 110 2016-06-19 13u35.csv", "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv",
"AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv", "AdditionalListMode_M1bI Syn 60 90 90 110 2018-06-22 16u09.csv")
预期结果:
"AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
"AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
很抱歉可能发布了一个稍微重复的问题,但在对 SO 和 google
进行了长时间搜索后,我无法找到真正的解决方案
编辑结果:
我使用了一个包含 359 个文件名的数据集来获取所有有效答案的微基准测试结果(包括关键字顺序敏感的答案:
Unit: microseconds
expr min lq mean median uq max neval
filesshort <- filenames[apply(sapply(keywords, function(x) grepl(x, filenames)), 1, function(y) sum(y) == length(y))] 1220.588 1318.093 1691.7377 1366.2530 1635.477 5718.049 50
filesshort <- filenames[Reduce("&", lapply(keywords, function(x) grepl(x, filenames)))] 532.922 568.055 640.7301 591.5435 637.137 1971.415 50
filesshort <- grep(paste(keywords, collapse = ".*"), filenames, value = T) 302.779 331.991 379.9144 343.4390 380.941 790.303 50
filesshort <- regmatches(filenames, regexpr(paste(keywords, collapse = ".*"), filenames)) 2244.587 2310.905 2668.2153 2456.9655 2708.820 5758.314 50
filesshort <- unlist(regmatches(filenames, gregexpr(paste(keywords, collapse = ".*"), filenames))) 3768.742 3985.463 5491.8536 4654.5750 5322.109 42538.964 50
使用 grep 的方程式 3 是迄今为止最快的,但它也对关键字顺序敏感。
如果我们同时考虑速度和对关键字顺序的容忍度,与其他 4 个答案相比,带有 reduce 的方程式 2 是赢家。
filenames[Reduce("&", lapply(keywords, function(x) grepl(x, filenames)))]
#[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
#[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
filenames[apply(sapply(keywords, function(x) grepl(x, filenames)), 1, function(y) sum(y) == length(y))]
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
keywords <- c("Syn.*2017")
> filenames[grep(keywords,filenames)]
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
grep("Syn.*?2017",filenames,value = T)
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
regmatches(filenames,regexpr("(.*Syn).*?2017(.*)",filenames)))
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
unlist(regmatches(filenames,gregexpr("(.*Syn).*?2017(.*)",filenames)))
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
您可以使用适合手头工作的任何东西。
我正在尝试根据用户在闪亮的应用程序中选择的一组关键字来过滤文件名列表,最终列表应该只包含包含所有部分关键字的文件
到目前为止我一直在尝试使用这段代码:
sapply(filenames, grepl, keywords)
但是如何从那个输出到全部为真的输出。
我从这个
all(sapply(filenames, grepl, keywords)
我的列表当然是假的。我可以编写一个列表应用函数来将 sapply(....)
应用于每个元素,但也许有更有效的方法来一次实现所有?
我也查看了 grep
或 grepl
选项,但它们只接受 OR
个参数,似乎没有 AND
。
示例关键字:
keywords <- c("Syn", "2017")
示例列表:
filenames <-
c("AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2016-06-18 13u22.csv", "AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2016-06-19 13u26.csv",
"AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2017-06-19 13u27.csv", "AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2017-06-20 13u11.csv",
"AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2018-06-21 13u12.csv", "AdditionalListMode_M1bI Euk SWS 60 20 90 90 80 2018-06-22 16u00.csv",
"AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2016-06-18 13u25.csv", "AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2016-06-19 13u29.csv",
"AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2017-06-20 13u14.csv", "AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2017-06-21 13u15.csv",
"AdditionalListMode_M1bI Large Euk SWS 50 20 90 90 80 2018-06-22 16u03.csv", "AdditionalListMode_M1bI Syn 60 90 90 110 2016-06-18 13u31.csv",
"AdditionalListMode_M1bI Syn 60 90 90 110 2016-06-19 13u35.csv", "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv",
"AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv", "AdditionalListMode_M1bI Syn 60 90 90 110 2018-06-22 16u09.csv")
预期结果:
"AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
"AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
很抱歉可能发布了一个稍微重复的问题,但在对 SO 和 google
进行了长时间搜索后,我无法找到真正的解决方案编辑结果: 我使用了一个包含 359 个文件名的数据集来获取所有有效答案的微基准测试结果(包括关键字顺序敏感的答案:
Unit: microseconds
expr min lq mean median uq max neval
filesshort <- filenames[apply(sapply(keywords, function(x) grepl(x, filenames)), 1, function(y) sum(y) == length(y))] 1220.588 1318.093 1691.7377 1366.2530 1635.477 5718.049 50
filesshort <- filenames[Reduce("&", lapply(keywords, function(x) grepl(x, filenames)))] 532.922 568.055 640.7301 591.5435 637.137 1971.415 50
filesshort <- grep(paste(keywords, collapse = ".*"), filenames, value = T) 302.779 331.991 379.9144 343.4390 380.941 790.303 50
filesshort <- regmatches(filenames, regexpr(paste(keywords, collapse = ".*"), filenames)) 2244.587 2310.905 2668.2153 2456.9655 2708.820 5758.314 50
filesshort <- unlist(regmatches(filenames, gregexpr(paste(keywords, collapse = ".*"), filenames))) 3768.742 3985.463 5491.8536 4654.5750 5322.109 42538.964 50
使用 grep 的方程式 3 是迄今为止最快的,但它也对关键字顺序敏感。 如果我们同时考虑速度和对关键字顺序的容忍度,与其他 4 个答案相比,带有 reduce 的方程式 2 是赢家。
filenames[Reduce("&", lapply(keywords, function(x) grepl(x, filenames)))]
#[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
#[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
filenames[apply(sapply(keywords, function(x) grepl(x, filenames)), 1, function(y) sum(y) == length(y))]
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
keywords <- c("Syn.*2017")
> filenames[grep(keywords,filenames)]
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
grep("Syn.*?2017",filenames,value = T)
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
regmatches(filenames,regexpr("(.*Syn).*?2017(.*)",filenames)))
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
unlist(regmatches(filenames,gregexpr("(.*Syn).*?2017(.*)",filenames)))
[1] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-20 13u20.csv"
[2] "AdditionalListMode_M1bI Syn 60 90 90 110 2017-06-21 13u21.csv"
您可以使用适合手头工作的任何东西。