如何编写读取文本文件目录、更改它们并将其保存到同一目录中的 R 函数

How to write an R function that reads in a directory of text files, changes them, and saves into the same directory

曾尝试在其他地方寻找答案,但没有成功,所以这里是:

我有一个包含自然语言(立法)的 .txt 文件的目录。我想在目录中读取,使用新函数对每个文件执行一些字符串操作,并使用新文件名将文件保存在同一目录中。

示例数据(注意末尾的 "Commencement Information" 行):

example <- "(7) A person who is not a UK national commits an offence under this section if 
            — (a) any part of the arranging or facilitating takes place in the United 
            Kingdom, or (b) the travel consists of arrival in or entry into, departure
            from, or travel within, the United Kingdom. Commencement Information I2 S. 2 
            in force at 31.7.2015 by S.I. 2015/1476, reg. 2(a) (with regs. 6-8)"

我整理了目录中的阅读内容(在 this 回答的帮助下),效果很好:

files <- list.files(path="../txt_copies/", pattern="*.txt", all.files=T, full.names = T) #filenames
filelist <- lapply(files, read.delim) #read in files from the filenames list
names(filelist) <- paste0(basename(file_path_sans_ext(files))) #name list elements by filenames
list2env(filelist, envir=.GlobalEnv) #move the list elements into the global env as objects

问题出在我的函数上:

page_cleaner <- function(x) {

  txt <- x

  # clean text and print confirmation
  txt <- str_replace_all(txt, "(F)(\d).+?(\n)", "")
  txt <- gsub("Textual Amendments", "", txt)
  print("Text cleaned of Textual Amendments")
  txt <- str_replace_all(txt, "(Commencement Information).+?(\d)\)", "")
  txt <- str_replace_all(txt, "(Commencement Information).+?(\w)\)", "")
  print("Text cleaned of Commencement Information")

  x <- txt

}

lapply(names(filelist), page_cleaner)

应该return:

[1] "(7) A person who is not a UK national commits an offence under this section if
     — (a) any part of the arranging or facilitating takes place in the United Kingdom, 
     or (b) the travel consists of arrival in or entry into, departure from, or travel 
     within, the United Kingdom.

当我自己调用该函数时,它似乎在示例中运行良好,例如page_cleaner(example) 但不在文件列表中。

我相当确定这会奏效,但我不知道哪里出了问题。这些字符串操作在函数之外工作得很好。在示例数据中,他们应该删除从 "Commencement Information" 到字符串末尾的所有内容。

我已经有了一个方法,可以在完成所有操作后将对象保存到目录中,不需要帮助。

谢谢!

为 lapply 调用 列表 ,而不是名称。所以,修改这一行:

lapply(names(filelist), page_cleaner)

lapply(filelist, page_cleaner)

补充建议:

您可以简单地使用 stringr::str_removestringr::str_remove_all 来删除字符串。

page_cleaner <- function(txt) {

  # clean text and print confirmation
  txt <- txt %>% 
    str_remove_all("(F)(\d).+?(\n)") %>%
    str_remove("Textual Amendments")
  print("Text cleaned of Textual Amendments")

  txt <-txt %>% 
    str_remove_all("(Commencement Information).+?(\d)\)")
  print("Text cleaned of Commencement Information")

  return(txt)

}