如何使用 R 从网页中检索多个表

Question

我想使用 R，

提取所有疫苗 table 以及左侧描述及其在 table 内的描述

这是网页的link

这是第一个 table 在网页上的样子：

我尝试使用 XML 包，但没有成功，我使用了：

vup<-readHTMLTable("https://milken-institute-covid-19-tracker.webflow.io/#vaccines_intro", which=5)

我收到一个错误：


Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
In addition: Warning message:
XML content does not seem to be XML: ''

如何操作？

Answer 1

此网页未使用表格，因此是您出错的原因。由于多小节和隐藏文本，页面上的格式非常复杂，需要单独找到感兴趣的节点。

我更喜欢使用“rvest”和“xml2”包以获得更简单、更直接的语法。
这不是一个完整的解决方案，应该能让您朝着正确的方向前进。

library(rvest)
library(dplyr)

#find the top of the vacine section
parentvaccine <- page %>% html_node(xpath="//div[@id='vaccines_intro']") %>% xml_parent()

#find the vacine rows
vaccines <- parentvaccine %>% html_nodes(xpath = ".//div[@class='chart_row for_vaccines']")

#find info on each one
company <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_developer w-richtext']") %>% html_text()
product <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_vaccines w-richtext']") %>% html_text()
phase <- vaccines %>% html_node(xpath = ".//div[@class='is_h5-2 is_stage']") %>% html_text()
misc <- vaccines %>% html_node(xpath = ".//div[@class='chart_row-expanded for_vaccines']") %>% html_text()


#determine vacine type
#Get vacine type
vaccinetypes <- parentvaccine %>% html_nodes(xpath = './/div[@class="chart-section for_vaccines"]') %>% 
   html_node('div.is_h3') %>% html_text()
#dtermine the number of vacines in each category
lengthvector <-parentvaccine %>% html_nodes(xpath = './/div[@role="list"]') %>% xml_length() %>% sum()
#make vector of correct length
VaccineType <- rep(vaccinetypes, each=lengthvector)

answer <- data.frame(VaccineType,  company, product, phase)
head(answer)

要生成此代码，需要阅读 html 代码并识别所需信息的正确节点和唯一属性。

如何使用 R 从网页中检索多个表

How to retrieve a multiple tables from a webpage using R

html

r

rvest

webflow