如何使用 R 从 php 网站上抓取大型 table

Question

我正在尝试从 'https://www.metabolomicsworkbench.org/data/mb_structure_ajax.php' 中抓取 table。

我在网上找到的代码 (rvest) 不起作用

library(rvest)
url <- "https://www.metabolomicsworkbench.org/data/mb_structure_ajax.php"
A <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="containerx"]/div[1]/table') %>%
  html_table()

A 是 'list of 0'

我应该如何修复此代码或有更好的方法吗？

提前致谢。

Answer 1

页面源码为JS生成。这是你要做的：

打开浏览器的Dev Tool，进入Network选项卡。
点击其中一个页面，看看发生了什么（我点击了第 4 页）。您可以看到该页面向 https://www.metabolomicsworkbench.org/data/mb_structure_tableonly.php 发送了 POST 请求并获取了其中的内容。以下是参数：
模仿 rvest 的 POST 请求。这是抓取所有页面的代码：

library(rvest)

url <- "https://www.metabolomicsworkbench.org/data/mb_structure_tableonly.php"
pg <- html_session(url)
data <- 
  purrr::map_dfr(
    1:4288, # you might wanna change it to a small number to try first or scrape multiple times and combine data frames later, in case something happens in the middle
    function(i) {
      pg <- rvest:::request_POST(pg,
                                 url,
                                 body = list(
                                   page = i
                                 ))
      read_html(pg) %>%
        html_node("table") %>%
        html_table() 
    }
  )

如何使用 R 从 php 网站上抓取大型 table

How to scrape a large table from a php website using R

r

web-scraping

scrape

rvest