R {drake} 计划：将多个数据集读入单个目标

Question

我开始将 {drake} 用于数据生产管道。我使用的原始数据非常大，被分成约 130 个单独的 (Stata) 文件。因此，应分别处理每个文件。为了保持可读性，我使用 target()、transform() 和 map() 来指定我的计划。这看起来类似于下面的代码：

plan <- drake_plan(
    dta_paths = list.files(my_folder, full.names = TRUE),
    dfs = target(
        read.dta13(dta_path),
        transform = map(dta_path = dta_paths)
    )
)

所以当我make()计划时，我得到以下错误：

target dfs_dta_paths

Warning: target dfs_dta_paths warnings:

the condition has length > 1 and only the first element will be used

the condition has length > 1 and only the first element will be used

the condition has length > 1 and only the first element will be used

fail dfs_dta_paths

Error: Target dfs_dta_paths failed. Call diagnose(dfs_dta_paths) for details. Error message:

Expecting a single string value: [type=character; extent=129].

根据我从这个警告和错误消息中了解到的情况，不同文件路径上的映射不起作用，完整的向量被传递给第一个函数调用。我读了 https://books.ropensci.org/drake/static.html#map 但它对解决问题没有帮助。将路径向量转换为列表也无济于事。

从我得到了预定义网格的想法，它实际上按照建议工作。但是因为我只需要一个矢量，而不是一个复杂的网格，所以这对我来说看起来像是过度工程化。

我觉得我错过了一些明显的东西，但我无法发现它。知道我的代码有什么问题吗？

我知道https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets，但由于我想在数据清理过程中进行迭代，所以我认为创建如上所示的dfs目标会有所帮助。

Answer 1

当您使用 target(transform = ...) 时，最好在将计划提供给 make() 之前将其可视化。可能需要几次迭代才能使其正确。这是您当前的计划。

library(drake)
plan <- drake_plan(
  dta_paths = list.files(my_folder, full.names = TRUE),
  dfs = target(
    read.dta13(dta_path),
    transform = map(dta_path = dta_paths)
  )
)

plan
#> # A tibble: 2 x 2
#>   target        command                                 
#>   <chr>         <expr>                                  
#> 1 dta_paths     list.files(my_folder, full.names = TRUE)
#> 2 dfs_dta_paths read.dta13(dta_paths)

config <- drake_config(plan)
vis_drake_graph(config)

^{由 reprex package (v0.3.0)}

于 2020 年 1 月 16 日创建

要每个目标读取一个文件，我推荐下面的计划。有关使用 !!.

的更多信息，请参阅 https://books.ropensci.org/drake/static.html#tidy-evaluation

library(drake)

# create some faux stata files for the example.
my_folder <- fs::dir_create("folder")
file.create("folder/file1.dta")
#> [1] TRUE
file.create("folder/file2.dta")
#> [1] TRUE

# Since you are using static branching (https://books.ropensci.org/drake/static.html)
# this needs to be defined up front.
# It does not need to be a target, re https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets
dta_paths <- list.files(my_folder, full.names = TRUE)

plan <- drake_plan(
  dfs = target(
    # Use !! here to literally insert the path so file_out() can mark it for tracking.
    read.dta13(file_in(!!dta_path)),
    # Use !! here to insert the actual vector of paths instead of the symbol `dta_paths`
    transform = map(dta_path = !!dta_paths)
  )
)

plan
#> # A tibble: 2 x 2
#>   target               command                                
#>   <chr>                <expr>                                 
#> 1 dfs_folder.file1.dta read.dta13(file_in("folder/file1.dta"))
#> 2 dfs_folder.file2.dta read.dta13(file_in("folder/file2.dta"))

config <- drake_config(plan)
vis_drake_graph(config)

^{由 reprex package (v0.3.0)}

于 2020 年 1 月 16 日创建

R {drake} 计划：将多个数据集读入单个目标

R {drake} plan: Read many datasets into single target

r

drake-r-package