在 R 的文件夹中读取多个本地 html 文件

read multiples local html files in a folder in R

我电脑的一个文件夹中有几个 HTML 文件。我想在 R 中阅读它们,尽量保持原始格式。顺便说一句,只有文字。我尝试了两种方法,但都失败了:

##first approach
 library (tm)
 cname <- file.path("C:", "Users", "usuario", "Desktop", "DEADataset", "The Phillipines", "gazzetes.presihtml")
  docs <- Corpus(DirSource(cname))
## second approach
 list_files_path<- list.files(path = './gazzetes.presihtml')
 a<- paste0(list_files_path, names) # vector names contain the names of the file with the .HTML extension
 rawHTML <- readLines(a)

猜猜看?一切顺利

除了 readLines 只接受一个连接之外,您的第二种方法接近于工作,但您给它提供了一个包含多个文件的向量。您可以使用 lapplyreadLines 来实现此目的。这是一个例子:

# generate vector of html files
files <- c('/path/to/your/html/file1', '/path/to/your/html/file2')

# readLines for each file and put them in a list
lineList <- lapply(files, readLines)

# create a character vector that contains all lines from all files
lineVector <- unlist(lineList)

# collapse the character vector into a single string
html <- paste(lineVector , collapse = '\n')

# print the string with original formatting
cat(html)