Webscraping：使用 inspect 在 R 中找到 node/table ID

Question

我正在做一个网络抓取练习，我想使用下面的 url 获得下面的 table:

https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory

2022 年 4 月 7 日更新的 COVID-19 病例、死亡和发病率（按地点）[5]

我右键单击浏览器，检查并希望找到 table ID/node 它将替换下面代码中的 ?。我找不到这个节点。

library(tidyverse)
library(rvest)

# get the data 

url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"

html_data <- read_html(url)

html_data %>%
  html_node("??") %>% # how do I get the node containing the table
  html_table() %>% 
  as_tibble()

谢谢

Answer 1

使用浏览器获取 table 的 xpath 并使用它代替 "??"。

suppressPackageStartupMessages({
  library(httr)
  library(rvest)
  library(dplyr)
})

url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
xp <- "/html/body/div[3]/div[3]/div[5]/div[1]/div[15]/div[5]/table"

html_data <- read_html(url)

html_data %>%
  html_elements(xpath = xp) %>% # how do I get the node containing the table
  html_table() %>%
  .[[1]] %>%
  select(-1)
#> # A tibble: 218 x 4
#>    Country                `Deaths / million` Deaths    Cases      
#>    <chr>                  <chr>              <chr>     <chr>      
#>  1 World[a]               783                6,166,510 495,130,920
#>  2 Peru                   6,366              212,396   3,549,511  
#>  3 Bulgaria               5,314              36,655    1,143,424  
#>  4 Bosnia and Herzegovina 4,819              15,728    375,948    
#>  5 Hungary                4,738              45,647    1,863,039  
#>  6 North Macedonia        4,433              9,234     307,142    
#>  7 Montenegro             4,308              2,706     233,523    
#>  8 Georgia                4,212              16,765    1,650,384  
#>  9 Croatia                3,833              15,646    1,105,315  
#> 10 Czech Republic         3,712              39,816    3,850,902  
#> # ... with 208 more rows

^{由 reprex package (v2.0.1)}

于 2022-04-08 创建

Answer 2

我建议使用更 stable、更快、描述性更好的 css 选择器列表，而不是冗长而脆弱的 xpath。有一个特定的父 ID（通常用于匹配的最快方法）和子 table class（第二快）组合，您可以使用：

library(magrittr)
library(rvest)

df <- read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory') %>%
  html_element('#covid-19-cases-deaths-and-rates-by-location .wikitable') %>%
  html_table()

Webscraping：使用 inspect 在 R 中找到 node/table ID

Webscraping: find the node/table ID in R using inspect

r

web-scraping

rvest

2022 年 4 月 7 日更新的 COVID-19 病例、死亡和发病率（按地点）[5]