rvest 处理隐藏文本

Question

我在抓取网页时没有看到我要找的data/text

我尝试用谷歌搜索这个问题，但没有成功。我也尝试使用 xpath 但我得到 {xml_nodeset (0)}

require(rvest)
url <- "https://www.nasdaq.com/market-activity/ipos"
IPOS <- read_html(url)
IPOS %>% xml_nodes("tbody") %>% xml_text()

输出：

[1] "\n            \n          \n          \n            \n          \n        "

我没有看到任何 IPO 数据。预期输出应包含 "Priced" IPO 的 table：代码、公司名称等...

Answer 1

table 数据似乎是由脚本加载的。您可以使用 RSelenium 包来获取它们。

library(rvest)
library(RSelenium)

rD <- rsDriver(port = 1210L, browser = "firefox", check = FALSE)
remDr <- rD$client

url <- "https://www.nasdaq.com/market-activity/ipos"
remDr$navigate(url)

IPOS <- remDr$getPageSource()[[1]] %>% 
  read_html() %>% 
  html_table(fill = TRUE)

str(IPOS)

PRICED <- IPOS[[3]]

Answer 2

不需要昂贵的 RSelenium。您可以在网络选项卡中找到一个 API 调用，将所有内容返回为 json。

例如，

library(jsonlite)

data <- jsonlite::read_json('https://api.nasdaq.com/api/ipo/calendar?date=2019-09')

View(data$data$priced$rows)

rvest 处理隐藏文本

rvest handling hidden text

r

hidden-field

web-scraping

rvest