将 skip 放入将文件分组到嵌套列表中的函数中

Putting skip in function that groups files together into nested list

我有一个包含 325 个电子表格的文件夹,其中包含莫斯科不同选区的选举结果。我正在尝试将属于同一市政区(更高级别的聚合)的文件组合在一起,以便我可以在此级别汇总选举结果。 (请参阅文件名的 dput 输出)。

我创建了一个函数,通过提取选区编号之前的字符串部分来正确匹配文件:

mf.vote.matcher <- function(file, filelist){

  #matches everything in the file name before the word "vote" (i.e. the mf name)
  match_string <- str_extract(file, pattern = ".*(?=vote)")
  matched_files <- grep(filelist, pattern = match_string)

  #listing
  matched_list <- list(filelist[matched_files])

}

但是,当使用 lapply 应用于完整文件列表时,它会遍历每个文件,创建一个包含许多冗余元素的列表。例如。第一市辖区有3个选区,导致函数输出重复这3个文件名3次。

有什么方法可以根据返回列表的长度将函数或lapply强制"skip"到下一个市辖区的文件中吗?

以下是文件名示例:

c("./Vote/Академический vote 1.xls", "./Vote/Академический vote 2.xls", 
"./Vote/Академический vote 3.xls", "./Vote/Алексеевский в городе Москве vote 1.xls", 
"./Vote/Алексеевский в городе Москве vote 2.xls", "./Vote/Алтуфьевский vote 1.xls", 
"./Vote/Алтуфьевский vote 2.xls", "./Vote/Алтуфьевский vote 3.xls", 
"./Vote/Арбат vote 1.xls", "./Vote/Арбат vote 2.xls", "./Vote/Аэропорт vote 1.xls", 
"./Vote/Аэропорт vote 2.xls", "./Vote/Аэропорт vote 3.xls", "./Vote/Бабушкинский vote 1.xls", 
"./Vote/Бабушкинский vote 2.xls", "./Vote/Басманный vote 1.xls", 
"./Vote/Басманный vote 2.xls", "./Vote/Басманный vote 3.xls", 
"./Vote/Беговой vote 1.xls", "./Vote/Беговой vote 2.xls", "./Vote/Бескудниковский vote 1.xls", 
"./Vote/Бескудниковский vote 2.xls", "./Vote/Бибирево vote 1.xls", 
"./Vote/Бибирево vote 2.xls", "./Vote/Бибирево vote 3.xls")

或者,您可以遍历独特的地区。

例如

library(stringr)

dat <- c("./Vote/Академический vote 1.xls", "./Vote/Академический vote 2.xls", 
             "./Vote/Академический vote 3.xls", "./Vote/Алексеевский в городе Москве vote 1.xls", 
             "./Vote/Алексеевский в городе Москве vote 2.xls", "./Vote/Алтуфьевский vote 1.xls", 
             "./Vote/Алтуфьевский vote 2.xls", "./Vote/Алтуфьевский vote 3.xls", 
             "./Vote/Арбат vote 1.xls", "./Vote/Арбат vote 2.xls", "./Vote/Аэропорт vote 1.xls", 
             "./Vote/Аэропорт vote 2.xls", "./Vote/Аэропорт vote 3.xls", "./Vote/Бабушкинский vote 1.xls", 
             "./Vote/Бабушкинский vote 2.xls", "./Vote/Басманный vote 1.xls", 
             "./Vote/Басманный vote 2.xls", "./Vote/Басманный vote 3.xls", 
             "./Vote/Беговой vote 1.xls", "./Vote/Беговой vote 2.xls", "./Vote/Бескудниковский vote 1.xls", 
             "./Vote/Бескудниковский vote 2.xls", "./Vote/Бибирево vote 1.xls", 
             "./Vote/Бибирево vote 2.xls", "./Vote/Бибирево vote 3.xls")


out = lapply(unique(str_extract_all(dat, ".*(?=vote)", simplify = TRUE)[, 1]), function(x) {
  dat[grepl(x, dat)]
}
)

> out
[[1]]
[1] "./Vote/Академический vote 1.xls" "./Vote/Академический vote 2.xls" "./Vote/Академический vote 3.xls"

[[2]]
[1] "./Vote/Алексеевский в городе Москве vote 1.xls" "./Vote/Алексеевский в городе Москве vote 2.xls" 

...etc

另一种对值进行分组的方法:

gsub('.*/Vote/(.+) vote .*', '\1', list, perl=TRUE) -> region
split(list, region) -> groups

("list" 是一个包含文件名的向量)