R {drake} 计划:将多个数据集读入单个目标
R {drake} plan: Read many datasets into single target
我开始将 {drake} 用于数据生产管道。我使用的原始数据非常大,被分成约 130 个单独的 (Stata) 文件。因此,应分别处理每个文件。为了保持可读性,我使用 target()
、transform()
和 map()
来指定我的计划。这看起来类似于下面的代码:
plan <- drake_plan(
dta_paths = list.files(my_folder, full.names = TRUE),
dfs = target(
read.dta13(dta_path),
transform = map(dta_path = dta_paths)
)
)
所以当我make()
计划时,我得到以下错误:
target dfs_dta_paths
Warning: target dfs_dta_paths warnings:
the condition has length > 1 and only the first element will be used
the condition has length > 1 and only the first element will be used
the condition has length > 1 and only the first element will be used
fail dfs_dta_paths
Error: Target dfs_dta_paths
failed. Call diagnose(dfs_dta_paths)
for details. Error message:
Expecting a single string value: [type=character; extent=129].
根据我从这个警告和错误消息中了解到的情况,不同文件路径上的映射不起作用,完整的向量被传递给第一个函数调用。我读了 https://books.ropensci.org/drake/static.html#map 但它对解决问题没有帮助。将路径向量转换为列表也无济于事。
从 我得到了预定义网格的想法,它实际上按照建议工作。但是因为我只需要一个矢量,而不是一个复杂的网格,所以这对我来说看起来像是过度工程化。
我觉得我错过了一些明显的东西,但我无法发现它。知道我的代码有什么问题吗?
我知道https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets,但由于我想在数据清理过程中进行迭代,所以我认为创建如上所示的dfs
目标会有所帮助。
当您使用 target(transform = ...)
时,最好在将计划提供给 make()
之前将其可视化。可能需要几次迭代才能使其正确。这是您当前的计划。
library(drake)
plan <- drake_plan(
dta_paths = list.files(my_folder, full.names = TRUE),
dfs = target(
read.dta13(dta_path),
transform = map(dta_path = dta_paths)
)
)
plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 dta_paths list.files(my_folder, full.names = TRUE)
#> 2 dfs_dta_paths read.dta13(dta_paths)
config <- drake_config(plan)
vis_drake_graph(config)
由 reprex package (v0.3.0)
于 2020 年 1 月 16 日创建
要每个目标读取一个文件,我推荐下面的计划。有关使用 !!
.
的更多信息,请参阅 https://books.ropensci.org/drake/static.html#tidy-evaluation
library(drake)
# create some faux stata files for the example.
my_folder <- fs::dir_create("folder")
file.create("folder/file1.dta")
#> [1] TRUE
file.create("folder/file2.dta")
#> [1] TRUE
# Since you are using static branching (https://books.ropensci.org/drake/static.html)
# this needs to be defined up front.
# It does not need to be a target, re https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets
dta_paths <- list.files(my_folder, full.names = TRUE)
plan <- drake_plan(
dfs = target(
# Use !! here to literally insert the path so file_out() can mark it for tracking.
read.dta13(file_in(!!dta_path)),
# Use !! here to insert the actual vector of paths instead of the symbol `dta_paths`
transform = map(dta_path = !!dta_paths)
)
)
plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 dfs_folder.file1.dta read.dta13(file_in("folder/file1.dta"))
#> 2 dfs_folder.file2.dta read.dta13(file_in("folder/file2.dta"))
config <- drake_config(plan)
vis_drake_graph(config)
由 reprex package (v0.3.0)
于 2020 年 1 月 16 日创建
我开始将 {drake} 用于数据生产管道。我使用的原始数据非常大,被分成约 130 个单独的 (Stata) 文件。因此,应分别处理每个文件。为了保持可读性,我使用 target()
、transform()
和 map()
来指定我的计划。这看起来类似于下面的代码:
plan <- drake_plan(
dta_paths = list.files(my_folder, full.names = TRUE),
dfs = target(
read.dta13(dta_path),
transform = map(dta_path = dta_paths)
)
)
所以当我make()
计划时,我得到以下错误:
target dfs_dta_paths
Warning: target dfs_dta_paths warnings:
the condition has length > 1 and only the first element will be used
the condition has length > 1 and only the first element will be used
the condition has length > 1 and only the first element will be used
fail dfs_dta_paths
Error: Target
dfs_dta_paths
failed. Calldiagnose(dfs_dta_paths)
for details. Error message:Expecting a single string value: [type=character; extent=129].
根据我从这个警告和错误消息中了解到的情况,不同文件路径上的映射不起作用,完整的向量被传递给第一个函数调用。我读了 https://books.ropensci.org/drake/static.html#map 但它对解决问题没有帮助。将路径向量转换为列表也无济于事。
从
我觉得我错过了一些明显的东西,但我无法发现它。知道我的代码有什么问题吗?
我知道https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets,但由于我想在数据清理过程中进行迭代,所以我认为创建如上所示的dfs
目标会有所帮助。
当您使用 target(transform = ...)
时,最好在将计划提供给 make()
之前将其可视化。可能需要几次迭代才能使其正确。这是您当前的计划。
library(drake)
plan <- drake_plan(
dta_paths = list.files(my_folder, full.names = TRUE),
dfs = target(
read.dta13(dta_path),
transform = map(dta_path = dta_paths)
)
)
plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 dta_paths list.files(my_folder, full.names = TRUE)
#> 2 dfs_dta_paths read.dta13(dta_paths)
config <- drake_config(plan)
vis_drake_graph(config)
由 reprex package (v0.3.0)
于 2020 年 1 月 16 日创建要每个目标读取一个文件,我推荐下面的计划。有关使用 !!
.
library(drake)
# create some faux stata files for the example.
my_folder <- fs::dir_create("folder")
file.create("folder/file1.dta")
#> [1] TRUE
file.create("folder/file2.dta")
#> [1] TRUE
# Since you are using static branching (https://books.ropensci.org/drake/static.html)
# this needs to be defined up front.
# It does not need to be a target, re https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets
dta_paths <- list.files(my_folder, full.names = TRUE)
plan <- drake_plan(
dfs = target(
# Use !! here to literally insert the path so file_out() can mark it for tracking.
read.dta13(file_in(!!dta_path)),
# Use !! here to insert the actual vector of paths instead of the symbol `dta_paths`
transform = map(dta_path = !!dta_paths)
)
)
plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 dfs_folder.file1.dta read.dta13(file_in("folder/file1.dta"))
#> 2 dfs_folder.file2.dta read.dta13(file_in("folder/file2.dta"))
config <- drake_config(plan)
vis_drake_graph(config)
由 reprex package (v0.3.0)
于 2020 年 1 月 16 日创建