rvest：关注具有相同标签的不同链接

Question

我正在用 R 做一个小项目，涉及从网站上抓取一些足球数据。这是 link 到其中一年的数据：

http://www.sports-reference.com/cfb/years/2007-schedule.html。

如您所见，有一个 "Date" 列，日期为 hyperlinked，此 hyperlink 将带您查看该特定游戏的统计数据，即我想抓取的数据。不幸的是，很多游戏都在同一天举行，这意味着它们的 hyperlink 是相同的。因此，如果我从 table（我已经完成）中抓取 hyperlinks，然后执行类似的操作：

url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
  stats = html_session(url) %>%
    follow_link(link[i]) %>%
    html_nodes('whateverthisnodeis') %>%
    html_table()
}

它将从每个日期对应的第一个 link 开始抓取。例如，2007 年 8 月 30 日有 11 场比赛，但如果我将其放入 follow_link 函数中，它每次都会从第一场比赛（Boise St. Weber St.）中获取数据。有什么方法可以指定我希望它向下移动 table?

我已经通过找出日期 hyperlinks 将您带到的网址的公式找到了解决方法，但这是一个非常复杂的过程，所以我想我会看看是否有人知道如何做到这一点。

Answer 1

你正在循环，但每次都设置同一个变量，试试这个：

url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
    stats[i] = html_session(url) %>%
    follow_link(link[i]) %>%
    html_nodes('whateverthisnodeis') %>%
    html_table()

}

Answer 2

这是一个完整的例子：

library(rvest)
library(dplyr)
library(pbapply)

# Get the main page

URL <- 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
pg <- html(URL)

# Get the dates links
links <- html_attr(html_nodes(pg, xpath="//table/tbody/tr/td[3]/a"), "href")

# I'm only limiting to 10 since I rly don't care about football 
# enough to waste the bandwidth.
#
# You can just remove the [1:10] for your needs
# pblapply gives you a much-needed progress bar for free

scoring_games <- pblapply(links[1:10], function(x) {

  game_pg <- html(sprintf("http://www.sports-reference.com%s", x))
  scoring <- html_table(html_nodes(game_pg, xpath="//table[@id='passing']"), header=TRUE)[[1]]
  colnames(scoring) <- scoring[1,]
  filter(scoring[-1,], !Player %in% c("", "Player"))

})

# you can bind_rows them all together but you should 
# probably add a column for the game then

bind_rows(scoring_games)

## Source: local data frame [27 x 11]
## 
##             Player            School   Cmp   Att   Pct   Yds   Y/A  AY/A    TD   Int  Rate
##              (chr)             (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1     Taylor Tharp       Boise State    14    19  73.7   184   9.7  10.7     1     0 172.4
## 2       Nick Lomax       Boise State     1     5  20.0     5   1.0   1.0     0     0  28.4
## 3    Ricky Cookman       Boise State     1     2  50.0     9   4.5 -18.0     0     1 -12.2
## 4         Ben Mauk        Cincinnati    18    27  66.7   244   9.0   8.9     2     1 159.6
## 5        Tony Pike        Cincinnati     6     9  66.7    57   6.3   8.6     1     0 156.5
## 6   Julian Edelman        Kent State    17    26  65.4   161   6.2   3.5     1     2 114.7
## 7       Bret Meyer        Iowa State    14    23  60.9   148   6.4   3.4     1     2 111.9
## 8       Matt Flynn   Louisiana State    12    19  63.2   128   6.7   8.8     2     0 154.5
## 9  Ryan Perrilloux   Louisiana State     2     3  66.7    21   7.0  13.7     1     0 235.5
## 10   Michael Henig Mississippi State    11    28  39.3   120   4.3  -5.4     0     6  32.4
## ..             ...               ...   ...   ...   ...   ...   ...   ...   ...   ...   ...

rvest：关注具有相同标签的不同链接

rvest: follow different links with same tag

r

hyperlink

rvest