如何编写 rscript 以从 HTML table 中提取 URL

Question

我正在尝试从页面的元素中提取每个 URL，例如“https://....zip”：https://divvy-tripdata.s3.amazonaws.com/index.html 使用 rvest库如下：

link <- "https://divvy-tripdata.s3.amazonaws.com/index.html"

library(rvest)
library(xml2)

html <- read_html(link)

html %>% html_attrs("href")

输出：

html %>% html_attrs("href") Error in html_attrs(., "href") : unused argument ("href")

你能帮我用 R 从上面的 link 中提取所有 URL 吗？

HTML: https://i.stack.imgur.com/5BiFU.jpg

Answer 1

Base R 解决方案，使用url 返回一级来读取和解析xml:

# Store as a variable the path url to be scrapped: base_url => character scalar
base_url <- "https://divvy-tripdata.s3.amazonaws.com"

# Resolve the zip urls: zip_urls => character vector
zip_urls <- paste(
  base_url, 
  gsub(
    ">(.*?)<\/",
    "\1",
    grep(
      "\.zip", 
      strsplit(
        readLines(base_url), 
        "\<Key\>")[[2]],
      value = TRUE
    )
  ),
  sep = "/"
)

Answer 2

链接来自 returns xml 浏览器发出的额外 GET 请求。您仍然可以使用 rvest 并获取关键节点，然后完成 url。

library(rvest)

base_url <- "https://divvy-tripdata.s3.amazonaws.com"
files <- read_html(base_url) |> html_elements('key') |> html_text() |> url_absolute(base_url)

对于旧的 R 版本，将 |> 替换为 %>% 并添加 library(magrittr) 作为导入。

如何编写 rscript 以从 HTML table 中提取 URL

How to write rscript to to extract URL from HTML table

html

r

web-scraping

rvest