Error/exception 使用 bind_rows() 和 lapply() 函数处理
Error/exception handling with bind_rows() and lapply() functions
我有一个函数可以从 url 列表中抓取 table:
getscore <- function(www0) {
require(rvest)
require(dplyr)
www <- html(www0)
boxscore <- www %>% html_table(fill = TRUE) %>% .[[1]]
names(boxscore)[3] <- "VG"
names(boxscore)[5] <- "HG"
names(boxscore)[6] <- "Type"
return(boxscore)
}
工作示例数据:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/20/",
"http://www.hockey-reference.com/boxscores/2014/12/21/",
"http://www.hockey-reference.com/boxscores/2014/12/22/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
但是,没有玩游戏的网址会破坏我的功能:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/23/",
"http://www.hockey-reference.com/boxscores/2014/12/24/",
"http://www.hockey-reference.com/boxscores/2014/12/25/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
如何将 error/exception 处理构建到我的函数中以跳过中断的 url?
代码应该是可重现的...
没有游戏时获得的table是完全不同的结构。您可以检查 colnames(boxscore) 是否符合预期。作为示例,我包含了对您的函数的改编,用于检查列 Visitor 是否可用。
getscore <- function(www0) {
require(rvest)
require(dplyr)
www <- html(www0)
boxscore <- www %>% html_table(fill = TRUE) %>% .[[1]]
if ("Visitor" %in% colnames(boxscore)){
names(boxscore)[3] <- "VG"
names(boxscore)[5] <- "HG"
names(boxscore)[6] <- "Type"
return(boxscore)
}
}
有了这个函数,你的例子就不会中断:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/23/",
"http://www.hockey-reference.com/boxscores/2014/12/24/",
"http://www.hockey-reference.com/boxscores/2014/12/25/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
这里一个不错的方法是使用 data.table
包中的 rbindlist
(它允许您使用 fill=TRUE
),这样您就可以绑定所有甚至 bind_rows
不起作用,但是您可以过滤非 NA 日期(本质上是 bind_rows
不起作用的网页),然后限制为 6 列,我猜您正在寻找有效数据。
library(data.table) # development vs. 1.9.5
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/20/",
"http://www.hockey-reference.com/boxscores/2014/12/21/",
"http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/24/") # not working
resdt<-rbindlist(
lapply(
www_list, function(www0){
message ("web is ", www0) # comment out this if you don't want message to appear
getscore(www0)}),fill=TRUE)
resdt[!is.na(Date),1:6,with=FALSE] # 6 column is valid data
Date Visitor VG Home HG Type
1: 2014-12-20 Colorado Avalanche 5 Buffalo Sabres 1
2: 2014-12-20 New York Rangers 3 Carolina Hurricanes 2 SO
3: 2014-12-20 Chicago Blackhawks 2 Columbus Blue Jackets 3 SO
4: 2014-12-20 Arizona Coyotes 2 Los Angeles Kings 4
5: 2014-12-20 Nashville Predators 6 Minnesota Wild 5 OT
6: 2014-12-20 Ottawa Senators 1 Montreal Canadiens 4
7: 2014-12-20 Washington Capitals 4 New Jersey Devils 0
8: 2014-12-20 Tampa Bay Lightning 1 New York Islanders 3
9: 2014-12-20 Florida Panthers 1 Pittsburgh Penguins 3
10: 2014-12-20 St. Louis Blues 2 San Jose Sharks 3 OT
11: 2014-12-20 Philadelphia Flyers 7 Toronto Maple Leafs 4
12: 2014-12-20 Calgary Flames 2 Vancouver Canucks 3 OT
13: 2014-12-21 Buffalo Sabres 3 Boston Bruins 4 OT
14: 2014-12-21 Toronto Maple Leafs 0 Chicago Blackhawks 4
15: 2014-12-21 Colorado Avalanche 2 Detroit Red Wings 1 SO
16: 2014-12-21 Dallas Stars 6 Edmonton Oilers 5 SO
17: 2014-12-21 Carolina Hurricanes 0 New York Rangers 1
18: 2014-12-21 Philadelphia Flyers 4 Winnipeg Jets 3 OT
19: 2014-12-22 San Jose Sharks 2 Anaheim Ducks 3 OT
20: 2014-12-22 Nashville Predators 5 Columbus Blue Jackets 1
21: 2014-12-22 Pittsburgh Penguins 3 Florida Panthers 4 SO
22: 2014-12-22 Calgary Flames 4 Los Angeles Kings 3 OT
23: 2014-12-22 Arizona Coyotes 1 Vancouver Canucks 7
24: 2014-12-22 Ottawa Senators 1 Washington Capitals 2
Date Visitor VG Home HG Type
如果你不熟悉data.table
,你可以直接用它做rbindlist
,然后将data.table
转换回data.frame
并执行通常的data.frame
操作。但是,你真的应该学习 data.table 因为它在大数据上非常快速和高效。
resdf<-as.data.frame(res.dt)
with(resdf,resdf[!is.na(Date),1:6])
Date Visitor VG Home HG Type
1 2014-12-20 Colorado Avalanche 5 Buffalo Sabres 1
2 2014-12-20 New York Rangers 3 Carolina Hurricanes 2 SO
3 2014-12-20 Chicago Blackhawks 2 Columbus Blue Jackets 3 SO
4 2014-12-20 Arizona Coyotes 2 Los Angeles Kings 4
5 2014-12-20 Nashville Predators 6 Minnesota Wild 5 OT
6 2014-12-20 Ottawa Senators 1 Montreal Canadiens 4
7 2014-12-20 Washington Capitals 4 New Jersey Devils 0
8 2014-12-20 Tampa Bay Lightning 1 New York Islanders 3
9 2014-12-20 Florida Panthers 1 Pittsburgh Penguins 3
10 2014-12-20 St. Louis Blues 2 San Jose Sharks 3 OT
11 2014-12-20 Philadelphia Flyers 7 Toronto Maple Leafs 4
12 2014-12-20 Calgary Flames 2 Vancouver Canucks 3 OT
13 2014-12-21 Buffalo Sabres 3 Boston Bruins 4 OT
14 2014-12-21 Toronto Maple Leafs 0 Chicago Blackhawks 4
15 2014-12-21 Colorado Avalanche 2 Detroit Red Wings 1 SO
16 2014-12-21 Dallas Stars 6 Edmonton Oilers 5 SO
17 2014-12-21 Carolina Hurricanes 0 New York Rangers 1
18 2014-12-21 Philadelphia Flyers 4 Winnipeg Jets 3 OT
19 2014-12-22 San Jose Sharks 2 Anaheim Ducks 3 OT
20 2014-12-22 Nashville Predators 5 Columbus Blue Jackets 1
21 2014-12-22 Pittsburgh Penguins 3 Florida Panthers 4 SO
22 2014-12-22 Calgary Flames 4 Los Angeles Kings 3 OT
23 2014-12-22 Arizona Coyotes 1 Vancouver Canucks 7
24 2014-12-22 Ottawa Senators 1 Washington Capitals 2
我有一个函数可以从 url 列表中抓取 table:
getscore <- function(www0) {
require(rvest)
require(dplyr)
www <- html(www0)
boxscore <- www %>% html_table(fill = TRUE) %>% .[[1]]
names(boxscore)[3] <- "VG"
names(boxscore)[5] <- "HG"
names(boxscore)[6] <- "Type"
return(boxscore)
}
工作示例数据:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/20/",
"http://www.hockey-reference.com/boxscores/2014/12/21/",
"http://www.hockey-reference.com/boxscores/2014/12/22/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
但是,没有玩游戏的网址会破坏我的功能:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/23/",
"http://www.hockey-reference.com/boxscores/2014/12/24/",
"http://www.hockey-reference.com/boxscores/2014/12/25/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
如何将 error/exception 处理构建到我的函数中以跳过中断的 url?
代码应该是可重现的...
没有游戏时获得的table是完全不同的结构。您可以检查 colnames(boxscore) 是否符合预期。作为示例,我包含了对您的函数的改编,用于检查列 Visitor 是否可用。
getscore <- function(www0) {
require(rvest)
require(dplyr)
www <- html(www0)
boxscore <- www %>% html_table(fill = TRUE) %>% .[[1]]
if ("Visitor" %in% colnames(boxscore)){
names(boxscore)[3] <- "VG"
names(boxscore)[5] <- "HG"
names(boxscore)[6] <- "Type"
return(boxscore)
}
}
有了这个函数,你的例子就不会中断:
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/23/",
"http://www.hockey-reference.com/boxscores/2014/12/24/",
"http://www.hockey-reference.com/boxscores/2014/12/25/")
nhl14_15 <- bind_rows(lapply(www_list, getscore))
这里一个不错的方法是使用 data.table
包中的 rbindlist
(它允许您使用 fill=TRUE
),这样您就可以绑定所有甚至 bind_rows
不起作用,但是您可以过滤非 NA 日期(本质上是 bind_rows
不起作用的网页),然后限制为 6 列,我猜您正在寻找有效数据。
library(data.table) # development vs. 1.9.5
www_list <- c("http://www.hockey-reference.com/boxscores/2014/12/20/",
"http://www.hockey-reference.com/boxscores/2014/12/21/",
"http://www.hockey-reference.com/boxscores/2014/12/22/",
"http://www.hockey-reference.com/boxscores/2014/12/24/") # not working
resdt<-rbindlist(
lapply(
www_list, function(www0){
message ("web is ", www0) # comment out this if you don't want message to appear
getscore(www0)}),fill=TRUE)
resdt[!is.na(Date),1:6,with=FALSE] # 6 column is valid data
Date Visitor VG Home HG Type
1: 2014-12-20 Colorado Avalanche 5 Buffalo Sabres 1
2: 2014-12-20 New York Rangers 3 Carolina Hurricanes 2 SO
3: 2014-12-20 Chicago Blackhawks 2 Columbus Blue Jackets 3 SO
4: 2014-12-20 Arizona Coyotes 2 Los Angeles Kings 4
5: 2014-12-20 Nashville Predators 6 Minnesota Wild 5 OT
6: 2014-12-20 Ottawa Senators 1 Montreal Canadiens 4
7: 2014-12-20 Washington Capitals 4 New Jersey Devils 0
8: 2014-12-20 Tampa Bay Lightning 1 New York Islanders 3
9: 2014-12-20 Florida Panthers 1 Pittsburgh Penguins 3
10: 2014-12-20 St. Louis Blues 2 San Jose Sharks 3 OT
11: 2014-12-20 Philadelphia Flyers 7 Toronto Maple Leafs 4
12: 2014-12-20 Calgary Flames 2 Vancouver Canucks 3 OT
13: 2014-12-21 Buffalo Sabres 3 Boston Bruins 4 OT
14: 2014-12-21 Toronto Maple Leafs 0 Chicago Blackhawks 4
15: 2014-12-21 Colorado Avalanche 2 Detroit Red Wings 1 SO
16: 2014-12-21 Dallas Stars 6 Edmonton Oilers 5 SO
17: 2014-12-21 Carolina Hurricanes 0 New York Rangers 1
18: 2014-12-21 Philadelphia Flyers 4 Winnipeg Jets 3 OT
19: 2014-12-22 San Jose Sharks 2 Anaheim Ducks 3 OT
20: 2014-12-22 Nashville Predators 5 Columbus Blue Jackets 1
21: 2014-12-22 Pittsburgh Penguins 3 Florida Panthers 4 SO
22: 2014-12-22 Calgary Flames 4 Los Angeles Kings 3 OT
23: 2014-12-22 Arizona Coyotes 1 Vancouver Canucks 7
24: 2014-12-22 Ottawa Senators 1 Washington Capitals 2
Date Visitor VG Home HG Type
如果你不熟悉data.table
,你可以直接用它做rbindlist
,然后将data.table
转换回data.frame
并执行通常的data.frame
操作。但是,你真的应该学习 data.table 因为它在大数据上非常快速和高效。
resdf<-as.data.frame(res.dt)
with(resdf,resdf[!is.na(Date),1:6])
Date Visitor VG Home HG Type
1 2014-12-20 Colorado Avalanche 5 Buffalo Sabres 1
2 2014-12-20 New York Rangers 3 Carolina Hurricanes 2 SO
3 2014-12-20 Chicago Blackhawks 2 Columbus Blue Jackets 3 SO
4 2014-12-20 Arizona Coyotes 2 Los Angeles Kings 4
5 2014-12-20 Nashville Predators 6 Minnesota Wild 5 OT
6 2014-12-20 Ottawa Senators 1 Montreal Canadiens 4
7 2014-12-20 Washington Capitals 4 New Jersey Devils 0
8 2014-12-20 Tampa Bay Lightning 1 New York Islanders 3
9 2014-12-20 Florida Panthers 1 Pittsburgh Penguins 3
10 2014-12-20 St. Louis Blues 2 San Jose Sharks 3 OT
11 2014-12-20 Philadelphia Flyers 7 Toronto Maple Leafs 4
12 2014-12-20 Calgary Flames 2 Vancouver Canucks 3 OT
13 2014-12-21 Buffalo Sabres 3 Boston Bruins 4 OT
14 2014-12-21 Toronto Maple Leafs 0 Chicago Blackhawks 4
15 2014-12-21 Colorado Avalanche 2 Detroit Red Wings 1 SO
16 2014-12-21 Dallas Stars 6 Edmonton Oilers 5 SO
17 2014-12-21 Carolina Hurricanes 0 New York Rangers 1
18 2014-12-21 Philadelphia Flyers 4 Winnipeg Jets 3 OT
19 2014-12-22 San Jose Sharks 2 Anaheim Ducks 3 OT
20 2014-12-22 Nashville Predators 5 Columbus Blue Jackets 1
21 2014-12-22 Pittsburgh Penguins 3 Florida Panthers 4 SO
22 2014-12-22 Calgary Flames 4 Los Angeles Kings 3 OT
23 2014-12-22 Arizona Coyotes 1 Vancouver Canucks 7
24 2014-12-22 Ottawa Senators 1 Washington Capitals 2