使用 Rvest 从网络中提取图像

Question

我正在尝试使用 R 从 PGA 网站提取球员照片。以下是我尝试获取图片网址，但它不显示图片或图片为空白，如下图所示。

if(!require(pacman))install.packages("pacman")
pacman::p_load('rvest', 'stringi', 'dplyr', 'tidyr', 'measurements', 'reshape2','foreach','doParallel','curl','httr','Iso','stringi','janitor')

PGA_url <- "https://www.pgatour.com"
pga_web=read_html(paste0(PGA_url,'/players.html'))
plyers_photo <- pga_web%>%html_nodes("[class='player-card']")%>%html_nodes('div.player-image-wrapper')%>%html_nodes('img')%>%html_attr('src')

有人可以告诉我我做错了什么吗？

Answer 1

如果您检查页面源代码，您会发现您正在根据页面源代码检索内容，即有默认 img 值的地方。扫视一下，您可能会注意到有一个 data-src 属性相邻，它具有匹配正则表达式的 png 的备用结尾：headshots_\d{5}\.png.

当 JavaScript 在浏览器中运行时（通过 rvest 的 xmlhttp 请求不会发生这种情况），这些 url 将动态更新，默认的 png 结尾替换为 data-src 属性中的结尾。

要么用该属性的值替换你得到的结尾，对于设置大小的小图像，或者使用直到 upload 的部分作为基础，并将其与提取的部分结合起来data-src 值给出大图像。

也不需要所有那些链接的 html_nodes() 调用。使用适当的 css 选择器列表进行一次调用即可。此外，更喜欢维护的 html_elements() 方法，而不是旧的 html_nodes():

library(rvest)
library(magrittr)

PGA_url <- "https://www.pgatour.com"
pga_web <- read_html(paste0(PGA_url, "/players.html"))
placeholder_link <- 'https://pga-tour-res.cloudinary.com/image/upload/'

plyers_photo <- pga_web %>%
  html_elements(".player-card .player-image-wrapper img") %>%
  html_attr("data-src") %>% paste0(placeholder_link, .)

使用 Rvest 从网络中提取图像

Extracting image from web using Rvest

screen-scraping

r

web-scraping

rvest