在目标工作流程中处理 zip 文件

Dealing with zip files in a targets workflow

我正在尝试建立一个工作流程,包括下载一个 zip 文件、提取其内​​容,并对每个文件应用一个函数。

我 运行 有几个问题:

  1. 如何可重复地设置空文件系统?也就是说,我希望能够创建一个空目录系统,稍后会将文件下载到该目录中。理想情况下,我想做类似 tar_target(my_dir, fs::dir_create("data"), format = "file") 的事情,但我从文档中知道空目录不能与 format = "file" 一起使用。我知道我可以在每个需要的实例中做一个 dir_create,但这看起来很笨拙。

  2. 在下面的 reprex 中,我想使用 pattern = map(x) 对每个文件单独进行操作。正如错误提示的那样,我需要为 parent 目标指定一个模式,因为 format = "file"。您可以看到,如果我确实为 parent 目标指定了模式,我将再次需要为 its parent 目标指定模式。据我所知,没有parents的target是不能设置pattern的(不过我之前错了很多次)

我觉得我做这件事完全错了 - 谢谢你抽出时间。

library(targets)
tar_script({
    tar_option_set(packages = c("tidyverse", "fs"))
    download_file <- function(url, dest) {
        download.file(url, dest)
        dest
    }
    do_stuff <- function(file_path) {
        fs::file_copy(file_path, file_path, overwrite = TRUE)
    }
    list(
      tar_target(downloaded_zip, 
                 download_file("https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip", 
                               path(dir_create("data"), "file", ext = "zip")), 
                 format = "file"), 
 
      tar_target(extracted_files, 
                 unzip(downloaded_zip, exdir = dir_create("data")), 
                 format = "file"), 

      tar_target(stuff_done, 
                 do_stuff(extracted_files), 
                 pattern = map(extracted_files), format = "file", 
                 iteration = "list"))
})
tar_make()
#> * start target downloaded_zip
#> trying URL 'https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip'
#> Content type 'application/zip' length 2036861 bytes (1.9 MB)
#> ==================================================
#> downloaded 1.9 MB
#> 
#> * built target downloaded_zip
#> * start target extracted_files
#> * built target extracted_files
#> * end pipeline
#> Error : Target stuff_done tried to branch over extracted_files, which is illegal. Patterns must only branch over explicitly declared targets in the pipeline. Stems and patterns are fine, but you cannot branch over branches or global objects. Also, if you branch over a target with format = "file", then that target must also be a pattern.
#> Error: callr subprocess failed: Target stuff_done tried to branch over extracted_files, which is illegal. Patterns must only branch over explicitly declared targets in the pipeline. Stems and patterns are fine, but you cannot branch over branches or global objects. Also, if you branch over a target with format = "file", then that target must also be a pattern.
#> Visit https://books.ropensci.org/targets/debugging.html for debugging advice.

reprex package (v2.0.1)

于 2021-12-08 创建

原回答

这里有一个想法:您可以使用 format = "url" 跟踪 URL,然后使 URL 成为所有文件分支的依赖项。下面,所有的files应该重新运行然后上游在线数据变化。这很好,因为所做的只是重新散列东西。但是 stuff_done 的所有分支都应该 运行 如果只有其中一些文件实际发生了变化。

编辑

再想一想,我们可能需要对所有本地文件进行散列处理。不是最有效的,但它可以完成工作。 targets 希望你使用它自己的内置存储系统而不是外部文件,所以如果你可以读取数据并 return 它以非文件格式,动态分支会更容易。

# _targets.R file
library(targets)
tar_option_set(packages = c("tidyverse", "fs"))
download_file <- function(url, dest) {
  download.file(url, dest)
  dest
}
do_stuff <- function(file_path) {
  file.info(file_path)
}
download_and_unzip <- function(url) {
  downloaded_zip <- tempfile()
  download_file(url, downloaded_zip)
  unzip(downloaded_zip, exdir = dir_create("data"))
}
list(
  tar_target(
    url,
    "https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip",
    format = "url"
  ),
  tar_target(
    files_bulk,
    download_and_unzip(url),
    format = "file"
  ),
  tar_target(file_names, files_bulk), # not a format = "file" target
  tar_target(
    files, {
      files-bulk # Re-hash all the files separately if any file changes.
      file_names
    },
    pattern = map(file_names),
    format = "file"
  ),
  tar_target(stuff_done, do_stuff(files), pattern = map(files))
)