rvest:关注具有相同标签的不同链接
rvest: follow different links with same tag
我正在用 R 做一个小项目,涉及从网站上抓取一些足球数据。这是 link 到其中一年的数据:
http://www.sports-reference.com/cfb/years/2007-schedule.html。
如您所见,有一个 "Date" 列,日期为 hyperlinked,此 hyperlink 将带您查看该特定游戏的统计数据,即我想抓取的数据。不幸的是,很多游戏都在同一天举行,这意味着它们的 hyperlink 是相同的。因此,如果我从 table(我已经完成)中抓取 hyperlinks,然后执行类似的操作:
url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
stats = html_session(url) %>%
follow_link(link[i]) %>%
html_nodes('whateverthisnodeis') %>%
html_table()
}
它将从每个日期对应的第一个 link 开始抓取。例如,2007 年 8 月 30 日有 11 场比赛,但如果我将其放入 follow_link 函数中,它每次都会从第一场比赛(Boise St. Weber St.)中获取数据。有什么方法可以指定我希望它向下移动 table?
我已经通过找出日期 hyperlinks 将您带到的网址的公式找到了解决方法,但这是一个非常复杂的过程,所以我想我会看看是否有人知道如何做到这一点。
你正在循环,但每次都设置同一个变量,试试这个:
url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
stats[i] = html_session(url) %>%
follow_link(link[i]) %>%
html_nodes('whateverthisnodeis') %>%
html_table()
}
这是一个完整的例子:
library(rvest)
library(dplyr)
library(pbapply)
# Get the main page
URL <- 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
pg <- html(URL)
# Get the dates links
links <- html_attr(html_nodes(pg, xpath="//table/tbody/tr/td[3]/a"), "href")
# I'm only limiting to 10 since I rly don't care about football
# enough to waste the bandwidth.
#
# You can just remove the [1:10] for your needs
# pblapply gives you a much-needed progress bar for free
scoring_games <- pblapply(links[1:10], function(x) {
game_pg <- html(sprintf("http://www.sports-reference.com%s", x))
scoring <- html_table(html_nodes(game_pg, xpath="//table[@id='passing']"), header=TRUE)[[1]]
colnames(scoring) <- scoring[1,]
filter(scoring[-1,], !Player %in% c("", "Player"))
})
# you can bind_rows them all together but you should
# probably add a column for the game then
bind_rows(scoring_games)
## Source: local data frame [27 x 11]
##
## Player School Cmp Att Pct Yds Y/A AY/A TD Int Rate
## (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1 Taylor Tharp Boise State 14 19 73.7 184 9.7 10.7 1 0 172.4
## 2 Nick Lomax Boise State 1 5 20.0 5 1.0 1.0 0 0 28.4
## 3 Ricky Cookman Boise State 1 2 50.0 9 4.5 -18.0 0 1 -12.2
## 4 Ben Mauk Cincinnati 18 27 66.7 244 9.0 8.9 2 1 159.6
## 5 Tony Pike Cincinnati 6 9 66.7 57 6.3 8.6 1 0 156.5
## 6 Julian Edelman Kent State 17 26 65.4 161 6.2 3.5 1 2 114.7
## 7 Bret Meyer Iowa State 14 23 60.9 148 6.4 3.4 1 2 111.9
## 8 Matt Flynn Louisiana State 12 19 63.2 128 6.7 8.8 2 0 154.5
## 9 Ryan Perrilloux Louisiana State 2 3 66.7 21 7.0 13.7 1 0 235.5
## 10 Michael Henig Mississippi State 11 28 39.3 120 4.3 -5.4 0 6 32.4
## .. ... ... ... ... ... ... ... ... ... ... ...
我正在用 R 做一个小项目,涉及从网站上抓取一些足球数据。这是 link 到其中一年的数据:
http://www.sports-reference.com/cfb/years/2007-schedule.html。
如您所见,有一个 "Date" 列,日期为 hyperlinked,此 hyperlink 将带您查看该特定游戏的统计数据,即我想抓取的数据。不幸的是,很多游戏都在同一天举行,这意味着它们的 hyperlink 是相同的。因此,如果我从 table(我已经完成)中抓取 hyperlinks,然后执行类似的操作:
url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
stats = html_session(url) %>%
follow_link(link[i]) %>%
html_nodes('whateverthisnodeis') %>%
html_table()
}
它将从每个日期对应的第一个 link 开始抓取。例如,2007 年 8 月 30 日有 11 场比赛,但如果我将其放入 follow_link 函数中,它每次都会从第一场比赛(Boise St. Weber St.)中获取数据。有什么方法可以指定我希望它向下移动 table?
我已经通过找出日期 hyperlinks 将您带到的网址的公式找到了解决方法,但这是一个非常复杂的过程,所以我想我会看看是否有人知道如何做到这一点。
你正在循环,但每次都设置同一个变量,试试这个:
url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
stats[i] = html_session(url) %>%
follow_link(link[i]) %>%
html_nodes('whateverthisnodeis') %>%
html_table()
}
这是一个完整的例子:
library(rvest)
library(dplyr)
library(pbapply)
# Get the main page
URL <- 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
pg <- html(URL)
# Get the dates links
links <- html_attr(html_nodes(pg, xpath="//table/tbody/tr/td[3]/a"), "href")
# I'm only limiting to 10 since I rly don't care about football
# enough to waste the bandwidth.
#
# You can just remove the [1:10] for your needs
# pblapply gives you a much-needed progress bar for free
scoring_games <- pblapply(links[1:10], function(x) {
game_pg <- html(sprintf("http://www.sports-reference.com%s", x))
scoring <- html_table(html_nodes(game_pg, xpath="//table[@id='passing']"), header=TRUE)[[1]]
colnames(scoring) <- scoring[1,]
filter(scoring[-1,], !Player %in% c("", "Player"))
})
# you can bind_rows them all together but you should
# probably add a column for the game then
bind_rows(scoring_games)
## Source: local data frame [27 x 11]
##
## Player School Cmp Att Pct Yds Y/A AY/A TD Int Rate
## (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1 Taylor Tharp Boise State 14 19 73.7 184 9.7 10.7 1 0 172.4
## 2 Nick Lomax Boise State 1 5 20.0 5 1.0 1.0 0 0 28.4
## 3 Ricky Cookman Boise State 1 2 50.0 9 4.5 -18.0 0 1 -12.2
## 4 Ben Mauk Cincinnati 18 27 66.7 244 9.0 8.9 2 1 159.6
## 5 Tony Pike Cincinnati 6 9 66.7 57 6.3 8.6 1 0 156.5
## 6 Julian Edelman Kent State 17 26 65.4 161 6.2 3.5 1 2 114.7
## 7 Bret Meyer Iowa State 14 23 60.9 148 6.4 3.4 1 2 111.9
## 8 Matt Flynn Louisiana State 12 19 63.2 128 6.7 8.8 2 0 154.5
## 9 Ryan Perrilloux Louisiana State 2 3 66.7 21 7.0 13.7 1 0 235.5
## 10 Michael Henig Mississippi State 11 28 39.3 120 4.3 -5.4 0 6 32.4
## .. ... ... ... ... ... ... ... ... ... ... ...