rvest：loop/map 使用 html_node 和 html_table 提取多个表

Question

我正在尝试以编程方式从 NBA Reference（我使用的是 2020 年 1 月 4 日，它有多场比赛）中提取给定日期的所有框分数。我首先创建一个整数列表来表示要提取的框分数的数量：

games<- c(1:3)

然后我在浏览器中使用 developer tools 来确定每个 table 包含的内容（您可以使用 selector gadget）：

#content > div.game_summaries > div:nth-child(1) > table.team

然后我使用 purrr::map 创建了一个要拉取的 table 的列表，使用 games:

map_list<- map(.x= '', paste, '#content > div.game_summaries > div:nth-child(', games, ') > table.teams', 
           sep = "") 
# check map_list
map_list

然后我尝试通过 for 循环运行这个列表来生成三个 tables，使用 tidyverse 和 rvest，它提供了一个错误：

for (i in map_list){
read_html('https://www.basketball-reference.com/boxscores/') %>% 
  html_node(map_list[[1]][i]) %>% 
  html_table() %>% 
  glimpse()
}

Error in selectr::css_to_xpath(css, prefix = ".//") : 
  Zero length character vector found for the following argument: selector
In addition: Warning message:
In selectr::css_to_xpath(css, prefix = ".//") :
  NA values were found in the 'selector' argument, they have been removed

作为参考，如果我明确表示 html 或从 map_list 中调用确切的项目，代码将按预期工作（运行下面的项目供参考）：

read_html('https://www.basketball-reference.com/boxscores/') %>% 
  html_node('#content > div.game_summaries > div:nth-child(1) > table.teams') %>% 
  html_table() %>% 
  glimpse()

read_html('https://www.basketball-reference.com/boxscores/') %>% 
  html_node(map_list[[1]][1]) %>% 
  html_table() %>% 
  glimpse()

如何使用列表进行这项工作？我看过但即使他们使用相同的网站，他们也不是同一个问题。

Answer 1

使用当前的 map_list，如果您想使用 for 循环，这就是您应该使用的

library(rvest)

for (i in seq_along(map_list[[1]])){
  read_html('https://www.basketball-reference.com/boxscores/') %>% 
   html_node(map_list[[1]][i]) %>% 
   html_table() %>% 
   glimpse()
}

但我认为这更简单，因为您不需要使用 map 来创建 map_list，因为 paste 是矢量化的：

map_list<- paste0('#content > div.game_summaries > div:nth-child(', games, ') > table.teams')
url <- 'https://www.basketball-reference.com/boxscores/'
webpage <- url %>% read_html()

purrr::map(map_list, ~webpage %>% html_node(.x) %>% html_table)

#[[1]]
#       X1  X2    X3
#1 Indiana 111 Final
#2 Atlanta 116      

#[[2]]
#        X1  X2    X3
#1  Toronto 121 Final
#2 Brooklyn 102      

#[[3]]
#       X1  X2    X3
#1  Boston 111 Final
#2 Chicago 104

Answer 2

此页面相当容易抓取。这是一个可能的解决方案，首先抓取游戏摘要节点“div 和 class=game_summary”。这提供了所有玩过的游戏的列表。这也允许使用保证 return 的 html_node 函数，从而保持列表大小相等。

每个游戏总结由三个子table组成，第一个和第三个table可以直接抓取。第二个 table 没有分配 class，因此更难检索。

library(rvest)

page <- read_html('https://www.basketball-reference.com/boxscores/')

#find all of the game summaries on the page
games<-page %>% html_nodes("div.game_summary")

#Each game summary has 3 sub tables
#game score is table 1 of class=teams
#the stats is table 3 of class=stats
# the quarterly score is the second table and does not have a class defined
  table1<-games %>%  html_node("table.teams") %>% html_table()
  stats <-games %>%  html_node("table.stats") %>% html_table()
  quarter<-sapply(games, function(g){
                      g %>%  html_nodes("table") %>% .[2] %>% html_table()
                  })

rvest：loop/map 使用 html_node 和 html_table 提取多个表

rvest: for loop/map to pull multiple tables using html_node & html_table

r

dplyr

magrittr

rvest

purrr