在R中写一个函数来处理docx文件

Question

我有一个包含 *.docx 文件的文件夹。我想将下面的脚本转换成某种循环函数来读取所有 docx 文件，但我真的不知道如何编写 R 函数，有人请指导我吗？

library(docxtractr)
real_world <- read_docx("C:/folder/doc1.docx")
docx_tbl_count(real_world)
tbls <- docx_extract_all_tbls(real_world)
a <- as.data.frame(tbls)

因此理想情况下，它会在每次提取新文档时附加新的 table。

谢谢佩迪

Answer 1

我不知道您的代码是否按预期工作。但在这里，我将其转换为带有路径参数的函数，以便您可以批量处理该路径下的所有 docx（不要在路径末尾使用斜杠）。默认参数是默认路径：

library(docxtractr)

docxextr <- function(pathh = ".") {
    files <- list.files(path = pathh)
    for (i in files) {
        filen <- sprintf("%s/%s", pathh, i)
        real_world <- read_docx(filen)
        docx_tbl_count(real_world) # didn't understand where this count goes?
        tbls <- docx_extract_all_tbls(real_world)
        a <- as.data.frame(tbls)
        return(a)
    }
}

Answer 2

编辑： 对于这个答案，我假设术语 "function" 未在 OP 的 R 函数意义上使用。我认为OP只是一种解决问题的算法。

#### load packages ####
library(docxtractr)
library(plyr)

#### load data ####
# define path of dir
pathto <- "Whosebug/41251392/example/"
# get path of every .docx-file in dir
filelist <- list.files(path = pathto, pattern = "*.docx", full.names = TRUE)
# read every file with docxtractr::read_docx()
tablelist <- lapply(filelist, read_docx)
# extract every table from every file with docxtractr::docx_extract_all_tbls()
tables <- lapply(tablelist, docx_extract_all_tbls)

#### append data to create one data.frame #### 
# combine extracted tables with plyr::ldply()
ldply(lapply(tables, function(x) {ldply(x, data.frame)}), data.frame)

最后一行有点难懂。看看?plyr::ldply。

在R中写一个函数来处理docx文件

Write a function in R to process docx files

r

docx

dataframe