使用 R 抓取嵌套链接

Question

我想对嵌套在属性名称中的链接进行网络废弃，此脚本有效，但是，仅检索 NA 的 URL。你能帮帮我吗，或者我在被剪断的脚本中遗漏了什么。

谢谢

# Test
library(rvest)
library(dplyr)

link <- "https://www.sreality.cz/hledani/prodej/byty/brno?_escaped_fragment_="
page <- read_html(link)

price <- page %>% 
  html_elements(".norm-price.ng-binding") %>% 
  html_text()

name <- page %>% 
  html_elements(".name.ng-binding") %>% 
  html_text()

location <- page %>% 
  html_elements(".locality.ng-binding") %>% 
  html_text()

href <- page %>% 
  html_nodes(".name.ng-binding") %>% 
  html_attr("href") %>% paste("https://www.sreality.cz", ., sep="")

flat <- data.frame(price, name, location, href, stringsAsFactors = FALSE)

Answer 1

您的 CSS 选择器选择了锚点的内联 html 而不是锚点。这应该有效：

 page %>% 
     html_nodes("a.title") %>%
     html_attr("ng-href") %>% 
     paste0("https://www.sreality.cz", .)

paste0(...) 是 shorthand 对于 paste(..., sep = '')

Answer 2

另一种使用JS路径的方式

page %>% 
  html_nodes('#page-layout > div.content-cover > div.content-inner > div.transcluded-content.ng-scope > div > div > div.content > div > div:nth-child(4) > div > div:nth-child(n)') %>% 
  html_nodes('a') %>% html_attr('href') %>% str_subset('detail') %>% unique() %>% paste("https://www.sreality.cz", ., sep="")

[1] "https://www.sreality.cz/detail/prodej/byt/4+1/brno-zabrdovice-tkalcovska/1857071452"          
 [2] "https://www.sreality.cz/detail/prodej/byt/3+kk/brno--/1336764508"                             
 [3] "https://www.sreality.cz/detail/prodej/byt/2+kk/brno-stary-liskovec-u-posty/3639359836"        
 [4] "https://www.sreality.cz/detail/prodej/byt/2+1/brno-reckovice-druzstevni/3845994844"           
 [5] "https://www.sreality.cz/detail/prodej/byt/2+1/brno-styrice-jilova/1102981468"                 
 [6] "https://www.sreality.cz/detail/prodej/byt/1+kk/brno-dolni-herspice-/1961502812"

使用 R 抓取嵌套链接

Web scraping of nested links with R

r

web-scraping

rvest