使用 read_xml 抓取数据时的错误消息

Question

作为 R 的新手 web-scraping 我希望在 web-scraping 项目问题上得到一些帮助。我想抓取生成此页面上图表的数据。

我检查了 Chrome 中的页面并确定 link 是 returns 数据。

Website Inspection Screenshot

使用这个 URL 我创建了以下代码来解析数据

url <- 'https://www.solactive.com/Indices/?indexhistory=DE000SL0BBT0&indexhistorytype=max'
index_data <- read_xml(url)

很遗憾，我收到了错误消息

Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Failed to parse text

我查看了具有以下内容的网页

回应Headers

content-encoding: gzip
content-length: 20624
content-type: text/html; charset=UTF-8
date: Thu, 21 Apr 2022 00:33:05 GMT
server: nginx
strict-transport-security: max-age=63072000
vary: Accept-Encoding

接受 Headers（快照）

accept: application/json, text/javascript, */*; q=0.01
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9

我也尝试过应用以下编码但没有成功

index_data <- read_xml(url, encoding = "gzip, deflate, br")

我要的是一个数据 table with index_id, date, value

如有任何帮助，我们将不胜感激。

谢谢

Answer 1

不知道为什么在 R 中，尽管设置了各种 headers，响应仍然是 html，而对于 Python，仅传递引用 header 和找回 JSON。但是，有点麻烦，您可以从响应中的 p 标记中提取并使用 jsonlite

进行解析

library(httr2)
library(rvest)

headers = c('referer' = 'https://www.solactive.com/Indices/?index=DE000SL0BBT0')

params = list('indexhistory' = 'DE000SL0BBT0', 'indexhistorytype' = 'max')

data <- request("https://www.solactive.com/Indices/") |> 
  (\(x) req_headers(x,  !!!headers))() |>  
  req_url_query(!!!params) |> 
  req_perform() |> 
  resp_body_html() |> 
  html_element('p') |>  
  html_text() |>  
  jsonlite::parse_json(simplifyVector = T)

使用 read_xml 抓取数据时的错误消息

Error message when using read_xml to scrape data

encoding

r

web-scraping

rvest