如何使用 R 从网页中检索多个表

How to retrieve a multiple tables from a webpage using R

我想使用 R,

提取所有疫苗 table 以及左侧描述及其在 table 内的描述

这是网页的link

这是第一个 table 在网页上的样子:

我尝试使用 XML 包,但没有成功,我使用了:

vup<-readHTMLTable("https://milken-institute-covid-19-tracker.webflow.io/#vaccines_intro", which=5)

我收到一个错误:


Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
In addition: Warning message:
XML content does not seem to be XML: '' 

如何操作?

此网页未使用表格,因此是您出错的原因。由于多小节和隐藏文本,页面上的格式非常复杂,需要单独找到感兴趣的节点。

我更喜欢使用“rvest”和“xml2”包以获得更简单、更直接的语法。
这不是一个完整的解决方案,应该能让您朝着正确的方向前进。

library(rvest)
library(dplyr)

#find the top of the vacine section
parentvaccine <- page %>% html_node(xpath="//div[@id='vaccines_intro']") %>% xml_parent()

#find the vacine rows
vaccines <- parentvaccine %>% html_nodes(xpath = ".//div[@class='chart_row for_vaccines']")

#find info on each one
company <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_developer w-richtext']") %>% html_text()
product <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_vaccines w-richtext']") %>% html_text()
phase <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_stage']") %>% html_text()
misc <- vaccines %>% html_node(xpath = ".//div[@class='chart_row-expanded for_vaccines']") %>% html_text()


#determine vacine type
#Get vacine type
vaccinetypes <- parentvaccine %>% html_nodes(xpath = './/div[@class="chart-section for_vaccines"]') %>% 
   html_node('div.is_h3') %>% html_text()
#dtermine the number of vacines in each category
lengthvector <-parentvaccine %>% html_nodes(xpath = './/div[@role="list"]') %>% xml_length() %>% sum()
#make vector of correct length
VaccineType <- rep(vaccinetypes, each=lengthvector)

answer <- data.frame(VaccineType,  company, product, phase)
head(answer)

要生成此代码,需要阅读 html 代码并识别所需信息的正确节点和唯一属性。