在日期列表上使用 readHTMLTable 并使用数据创建新的日期列
Use readHTMLTable over a List of Dates and Create New Date Column with Data
我正在尝试编写一个循环来对我通过公式提供的连续日期列表执行 readHTMLTable()
。我已成功导入日期之间的所有数据。但是,该数据没有日期列,因此使用我提供循环的日期序列,我希望循环读取 HTMLTable,然后添加一个新列,其中包含用于该迭代的日期。
这是我目前的情况:
library(XML)
library(RCurl)
library(plyr)
# create the days
x <- seq(as.Date("2015-04-10"), as.Date("2015-04-15"), by = "day")
# create a url template for sprintf()
utmp <- "http://www.basketball-reference.com/friv/dailyleaders.cgi?month=%d&day=%d&year=%d"
# convert to numeric matrix after splitting for year, month, day
m <- do.call(rbind, lapply(strsplit(as.character(x), "-"), type.convert))
# create the list to hold the results
tables <- vector("list", length(m))
# get the tables
for(i in seq_len(nrow(m))) {
# create the url for the day and if it exists, read it - if not, NULL
tables[[i]] <- if(url.exists(u <- sprintf(utmp, m[i, 2], m[i, 3], m[i, 1])))
readHTMLTable(u, stringsAsFactors = FALSE)
else NULL
}
data <- ldply(tables,data.frame)
所以基本上,我希望我的最终数据框以 m
为特色,作为一个名为 data$Date
.
的新列
感谢您提供的所有帮助,如果您需要任何说明,请告诉我!
考虑使用 mapply()
(apply 系列的多元函数)传递日期列表、url 和 table 迭代器来下载 html table秒。您可以避免矩阵处理,因为 format()
可以提取部分日期类型。另外,请考虑不要将 NULL 用于 non-existent url,因为它之后可能不会绑定。简单地过滤掉空元素。
# LIST OF DATES
x <- lapply(0:5, function(i) as.Date("2015-04-10")+i)
# LIST OF URLS
utmp <- "http://www.basketball-reference.com/friv/dailyleaders.cgi?month=%d&day=%d&year=%d"
urlist <- c(lapply(x, function(i) sprintf(utmp, as.numeric(format(i, '%m')),
as.numeric(format(i, '%d')),
as.numeric(format(i, '%y')))))
# USER DEFINED FUNCTION
tables <- vector("list", length(x))
tabledwnld <- function(dt, url, i) {
if (url.exists(url)) {
tableNodes <- readHTMLTable(url)
tables[[i]] <- tableNodes[[1]]
tables[[i]]['Date'] <- dt
return(tables)
}
}
# APPLY ABOVE FUNCTION (RETURNS LARGE MATRIX OF TABLES)
data <- mapply(tabledwnld, x, urlist, 1:6)
# BIND TO DATA FRAME
finaldata <- do.call(rbind, data)
此外,请注意评论中@hrbrmstr 的警告,如下所述。您可能想要 space 出您的 table 下载:
Except as specifically provided in this paragraph, you agree not to
use or launch any automated system, including without limitation,
robots, spiders, offline readers, or like devices, that accesses the
Site in a manner which sends more request messages to the Site server
in any given period of time than a typical human would normally
produce in the same period by using a conventional on-line Web browser
to read, view, and submit materials.
我正在尝试编写一个循环来对我通过公式提供的连续日期列表执行 readHTMLTable()
。我已成功导入日期之间的所有数据。但是,该数据没有日期列,因此使用我提供循环的日期序列,我希望循环读取 HTMLTable,然后添加一个新列,其中包含用于该迭代的日期。
这是我目前的情况:
library(XML)
library(RCurl)
library(plyr)
# create the days
x <- seq(as.Date("2015-04-10"), as.Date("2015-04-15"), by = "day")
# create a url template for sprintf()
utmp <- "http://www.basketball-reference.com/friv/dailyleaders.cgi?month=%d&day=%d&year=%d"
# convert to numeric matrix after splitting for year, month, day
m <- do.call(rbind, lapply(strsplit(as.character(x), "-"), type.convert))
# create the list to hold the results
tables <- vector("list", length(m))
# get the tables
for(i in seq_len(nrow(m))) {
# create the url for the day and if it exists, read it - if not, NULL
tables[[i]] <- if(url.exists(u <- sprintf(utmp, m[i, 2], m[i, 3], m[i, 1])))
readHTMLTable(u, stringsAsFactors = FALSE)
else NULL
}
data <- ldply(tables,data.frame)
所以基本上,我希望我的最终数据框以 m
为特色,作为一个名为 data$Date
.
感谢您提供的所有帮助,如果您需要任何说明,请告诉我!
考虑使用 mapply()
(apply 系列的多元函数)传递日期列表、url 和 table 迭代器来下载 html table秒。您可以避免矩阵处理,因为 format()
可以提取部分日期类型。另外,请考虑不要将 NULL 用于 non-existent url,因为它之后可能不会绑定。简单地过滤掉空元素。
# LIST OF DATES
x <- lapply(0:5, function(i) as.Date("2015-04-10")+i)
# LIST OF URLS
utmp <- "http://www.basketball-reference.com/friv/dailyleaders.cgi?month=%d&day=%d&year=%d"
urlist <- c(lapply(x, function(i) sprintf(utmp, as.numeric(format(i, '%m')),
as.numeric(format(i, '%d')),
as.numeric(format(i, '%y')))))
# USER DEFINED FUNCTION
tables <- vector("list", length(x))
tabledwnld <- function(dt, url, i) {
if (url.exists(url)) {
tableNodes <- readHTMLTable(url)
tables[[i]] <- tableNodes[[1]]
tables[[i]]['Date'] <- dt
return(tables)
}
}
# APPLY ABOVE FUNCTION (RETURNS LARGE MATRIX OF TABLES)
data <- mapply(tabledwnld, x, urlist, 1:6)
# BIND TO DATA FRAME
finaldata <- do.call(rbind, data)
此外,请注意评论中@hrbrmstr 的警告,如下所述。您可能想要 space 出您的 table 下载:
Except as specifically provided in this paragraph, you agree not to use or launch any automated system, including without limitation, robots, spiders, offline readers, or like devices, that accesses the Site in a manner which sends more request messages to the Site server in any given period of time than a typical human would normally produce in the same period by using a conventional on-line Web browser to read, view, and submit materials.