我如何使用 rvest 从 yahoo finance 抓取完整的数据集

Question

我正在尝试通过网络从雅虎财经获取比特币历史数据的完整数据集抓取，这是我的第一个选项代码块：

library(rvest)
library(tidyverse)

crypto_url <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <- html_nodes(crypto_url,css = "table")
cryp_table <- html_table(cryp_table,fill = T) %>% 
  as.data.frame()

我提供给 read_html() 的 link 已经选择了很长一段时间，但是它只得到前 101 行，最后一行是您加载的消息当你继续滚动时得到，这是我的第二个镜头，但我得到的是相同的：

col_page <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <- 
  col_page %>% 
  html_nodes(xpath = '//*[@id="Col1-1-HistoricalDataTable-Proxy"]/section/div[2]/table') %>% 
  html_table(fill = T)
cryp_final <- cryp_table[[1]]

如何获取整个数据集？

Answer 1

我想你可以得到下载的link，如果你查看网络，你会看到下载的link，在这种情况下：

"https://query1.finance.yahoo.com/v7/finance/download/BTC-USD?period1=1480464000&period2=1638230400&interval=1d&events=history&includeAdjustedClose=true"

嗯，这个 link 看起来像站点的 url，即我们可以修改 url link 以获取下载 link并阅读 csv。看代码：

library(stringr)
library(magrittr)

site <- "https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true"

base_download <- "https://query1.finance.yahoo.com/v7/finance/download/"

download_link <- site %>% 
  stringr::str_remove_all(".+(?<=quote/)|/history?|&frequency=1d") %>% 
  stringr::str_replace("filter", "events") %>% 
  stringr::str_c(base_download, .)

readr::read_csv(download_link)

我如何使用 rvest 从 yahoo finance 抓取完整的数据集

How can i scrape the complete dataset from yahoo finance with rvest

r

web-scraping

rvest

tidyverse