运行 R Markdown 在许多不同的数据集上并分别保存每个编织的 word 文档

Run R Markdown on many different datasets and save each knitted word document separately

我创建了一个 R Markdown 来检查一系列数据集中的错误(例如,给定列中是否有空白?如果是,则打印一条声明,说明有 NA 以及哪些行有 NA)。我已经将 R Markdown 设置为输出 bookdown::word_document2。我有大约 100 个数据集,我需要 运行 使用相同的 R Markdown,并分别为每个数据集输出一个 word 文档。

有没有办法 运行 跨所有数据集使用相同的 R Markdown 并为每个数据集获取一个新的 word 文档(这样它们就不会被覆盖)?所有数据集都在同一目录中。我知道每次编织文档时输出都会被覆盖;因此,我需要能够根据dataset/file名称保存每个word文档。

最小示例

创建包含 3 个 .xlsx 文件的目录

library(openxlsx)

setwd("~/Desktop")
dir.create("data")

dataset <-
  structure(
    list(
      name = c("Andrew", "Max", "Sylvia", NA, "1"),
      number = c(1, 2, 2, NA, NA),
      category = c("cool", "amazing",
                   "wonderful", "okay", NA)
    ),
    class = "data.frame",
    row.names = c(NA,-5L)
  )

write.xlsx(dataset, './data/test.xlsx')
write.xlsx(dataset, './data/dataset.xlsx')
write.xlsx(dataset, './data/another.xlsx')

RMarkdown

---
title: Hello_World
author: "Somebody"
output:
  bookdown::word_document2:
    fig_caption: yes
    number_sections: FALSE

---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

setwd("~/Desktop")

library(openxlsx)

# Load data for one .xlsx file. The other datasets are all in "/data". 
dataset <- openxlsx::read.xlsx("./data/test.xlsx")

```    

# Test for Errors

```{r test, echo=FALSE, comment=NA}

# Are there any NA values in the column?
suppressWarnings(if (TRUE %in% is.na(dataset$name)) {
  na.index <- which(is.na(dataset$name))
  cat(
    paste(
      "– There are NAs/blanks in the name column. There should be no blanks in this column. The following row numbers in this column need to be corrected:",
      paste(na.index, collapse = ', ')
    ),
    ".",
    sep = "",
    "\n",
    "\n"
  )
})

```

因此,我将 运行 这个 R Markdown 与 /data 目录中的第一个 .xlsx 数据集 (test.xlsx),并保存 word 文档。然后,我想对目录中列出的每个其他数据集执行此操作(即 list.files(path = "./data") 并保存一个新的 word 文档。因此,每个 RMarkdown 中唯一会改变的是这一行:dataset <- openxlsx::read.xlsx("./data/test.xlsx")。我知道我需要设置一些我可以在 rmarkdown::render 中使用的参数,但不确定如何设置。

我查看了其他一些 SO 条目(例如,How to combine two RMarkdown (.Rmd) files into a single output? or ), but most focus on combining .Rmd files, and not running different iterations of the same file. I've also looked at

我还尝试了 中的以下内容。在这里,所有添加都添加到上面的示例 R Markdown 中。

将此添加到 YAML header:

params:
  directory:
    value: x

将此添加到 setup 代码块:

# Pull in the data
dataset <- openxlsx::read.xlsx(file.path(params$directory))

然后,最后我运行下面的代码来渲染文档。

rmarkdown::render(
    input  = 'Hello_World.Rmd'
    , params = list(
        directory = "./data"
    )
)

然而,我收到以下错误,尽管我在 /data 中只有 .xlsx 文件:

Quitting from lines 14-24 (Hello_World.Rmd) Error: openxlsx can only read .xlsx files

我也在我的完整 .Rmd 文件上尝试了这个,但出现了以下错误,尽管路径完全相同。

Quitting from lines 14-24 (Hello_World.Rmd) Error in file(con, "rb") : cannot open the connection

*注意:第 14-24 行本质上是 .Rmd.

setup 部分

我不确定我需要更改什么。我还需要使用原始文件名生成多个输出文件(例如 test.xlsx 中的“test”、another.xlsx 中的“another”等)

您可以在循环中调用 render 来处理每个作为参数传递的 file :

dir_in <- 'data'
dir_out <- 'result'

files <- file.path(getwd(),dir_in,list.files(dir_in))

for (file in files) {
  print(file)
  rmarkdown::render(
    input  = 'Hello_World.Rmd',
    output_file = tools::file_path_sans_ext(basename(file)),
    output_dir = dir_out,
    params = list(file = file)
  )
}

降价:

---
title: Hello_World
author: "Somebody"
output:
  bookdown::word_document2:
    fig_caption: yes
    number_sections: FALSE
params: 
  file: ""
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library(openxlsx)

# Load data for one .xlsx file. The other datasets are all in "/data". 
dataset <- openxlsx::read.xlsx(file)

```    

# Test for Errors

```{r test, echo=FALSE, comment=NA}

# Are there any NA values in the column?
suppressWarnings(if (TRUE %in% is.na(dataset$name)) {
  na.index <- which(is.na(dataset$name))
  cat(
    paste(
      "– There are NAs/blanks in the name column. There should be no blanks in this column. The following row numbers in this column need to be corrected:",
      paste(na.index, collapse = ', ')
    ),
    ".",
    sep = "",
    "\n",
    "\n"
  )
})

```

使用 purrr 而不是 for 循环的替代方法,但使用与@Waldi 完全相同的设置。

渲染

dir_in <- 'data'
dir_out <- 'result'

files <- file.path(getwd(),dir_in,list.files(dir_in))

purrr::map(.x = files, .f = function(file){ 
  rmarkdown::render(
    input  = 'Hello_World.Rmd',
    output_file = tools::file_path_sans_ext(basename(file)),
    output_dir = dir_out,
    params = list(file = file)
  )
})

Rmarkdown

---
title: Hello_World
author: "Somebody"
output:
  bookdown::word_document2:
    fig_caption: yes
    number_sections: FALSE
params: 
  file: ""
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library(openxlsx)

# Load data for one .xlsx file. The other datasets are all in "/data". 
dataset <- openxlsx::read.xlsx(file)

```    

# Test for Errors

```{r test, echo=FALSE, comment=NA}

# Are there any NA values in the column?
suppressWarnings(if (TRUE %in% is.na(dataset$name)) {
  na.index <- which(is.na(dataset$name))
  cat(
    paste(
      "– There are NAs/blanks in the name column. There should be no blanks in this column. The following row numbers in this column need to be corrected:",
      paste(na.index, collapse = ', ')
    ),
    ".",
    sep = "",
    "\n",
    "\n"
  )
})

```