使用部分名称列表交叉引用 10k+ 文件名，以隔离匹配文件

Question

我对 R（学习绳索的生物学家）还是很陌生，希望有人能提供帮助。我有一个包含超过 10k+ 个文件的文件夹，每个文件名为“abc_Species name.filetype”，例如。 “abc_Panthera onca.filetype”、“abc_Boa constrictor.filetype”等。我需要从该文件列表中分离出对应于 1000 个特定物种的文件。到目前为止，我已经完成了基础操作，但在交叉匹配方面却很失败。

为了尝试不同的解决方案，我已经加载了各种包，但我无法看透，我不再确定哪些是必需的！这是我目前所拥有的：

library(filesstrings)
library(purrr)
library(stringr)
library(readr)
options(max.print=999999)

# Import file containing list of species of interest (among several columns, 
# the one titled "Species" is the column of my 1000 species-of-interest - 
# in that column are the list of Species names, most of which should be 
# found among the titles of the 10k+ files) 
Taxonomy <- read_csv("Taxonomy.csv")

#Create list of all of the 10k+ species files
Allspecieslist <- list.files(path = "C:/etc")

我尝试并失败的几个解决方案（也许只是因为我需要更多的演练，例如）要求我复制并粘贴整个列表，我不能没有首先创建一个单独的文件：

capture.output(Allspecieslist, file = "Allspecieslist.txt")

有没有人能帮我提供一个代码，在所有物种列表中找到包含分类法（i.e.species-感兴趣）列表中提到的物种的所有文件名？一旦我有了相关匹配文件的列表，我就会在一个单独的文件夹中创建副本。

非常感谢

Answer 1

希望我答对了你的问题。我将以水果为例。

精确匹配

假设您的文件夹中包含 10 个水果文件，您希望提取橙子、火龙果和西瓜的文件名。

# this is your 10k files
Allspecieslist <- c("abc_apple.filetype", "abc_orange.filetype", 
                    "abc_grape.filetype", "abc_melon.filetype", 
                    "abc_mango.filetype", "abc_pear.filetype", 
                    "abc_watermelon.filetype", "abc_dragon fruit.filetype", 
                    "abc_kiwi fruit.filetype", "abc_durian.filetype")

# let's pretend these are your target species
Taxonomy <- data.frame(Fruit = c("orange", "dragon fruit", "melon"))

# google regex if you wish to know more about the matching patterns
output_list <- Allspecieslist[str_extract(Allspecieslist, "(?<=_).*(?=\.)") %in% Taxonomy$Fruit]

[1] "abc_orange.filetype"       "abc_melon.filetype"       
[3] "abc_dragon fruit.filetype"

人为创建列表

如果您确定您的目标物种在您的 10k 文件中，您实际上不需要匹配它们，您可以创建一个列表来匹配文件名样式。

prefix = ("abc_")
[1] "abc_"

suffix = (".filetype")
[1] ".filetype"

paste0(prefix, Taxonomy$Fruit, suffix)
[1] "abc_orange.filetype"       "abc_dragon fruit.filetype"
[3] "abc_melon.filetype"

使用部分名称列表交叉引用 10k+ 文件名，以隔离匹配文件

Cross referencing 10k+ file names with partial-name list, to isolate matching files

filenames

r

cross-reference

精确匹配

人为创建列表