在 R 中嵌入 html table

Question

我对 R 中的 scraping/parsing HTML 还很陌生。我正在尝试从 Career Receiving Statistics 和 Career Rushing Statistics 中获取数据 tables来自 http://totalfootballstats.com/PlayerWR.asp?id=1218565。我知道 read readHTMLtable 函数，但是这两个 tables 都嵌入了太多垃圾中，我似乎无法通过根的子节点。

编辑：上述问题已解决。但是对于网站http://www.sports-reference.com/cfb/players/a-index.html，我正在尝试遍历所有玩家并访问他们的数据。我运行在访问他们各自的 url 链接时遇到了麻烦。我试过：

fb=htmlParse("http://www.sports-reference.com/cfb/players/a-index.html")
p1=getNodeSet(fb,'//pre')
con = textConnection(xmlValue(p1[[100]]))
players100 = read.table(con)

但这会导致错误“扫描错误（文件、内容、nmax、sep、dec、quote、skip、nlines、na.strings、：第 3 行没有 5 个元素

我尝试的另一件事是：

 links <- xpathSApply(fb, "//a/@href")

但我觉得应该有更好的方法来做到这一点？

Answer 1

好吧，这是来自不同网站的同一个播放器，更清晰。但是数据不匹配，所以有人弄错了。我的钱在 totalfootballstats.com。明智地选择您的资源！

readHTMLTable(
    "http://www.sports-reference.com/cfb/players/doyle-aaron-1.html"
)
# $receiving
#  Year     School Conf Class Pos  G Rec Yds  Avg TD Att Yds  Avg TD Plays Yds  Avg TD
# 1 1988 Miami (FL)  Ind        WR 11   1  12 12.0  0   1  34 34.0  0     2  46 23.0  0
# 2 1989 Miami (FL)  Ind        WR 11   8  93 11.6  1                     8  93 11.6  1

# $kick_ret
#   Year     School Conf Class Pos  G Ret Yds Avg TD Ret Yds Avg TD
# 1 1988 Miami (FL)  Ind        WR 11   1   8 8.0  0               
# 2 1989 Miami (FL)  Ind        WR 11

对于特定的请求，您似乎可以像这样构造一个有效的 URL，这也会同时为多个玩家构造路径。

## base URI 
u <- "http://www.sports-reference.com"
## player first and last names
first <- "bill"
last <- "adams"
## use sprintf() to make all the paths at once
fullPath <- sprintf("%s/cfb/players/%s-%s-1.html", u, first, last)
## read the table - I think you'll need to loop readHTMLTable() though
readHTMLTable(fullPath)
# $receiving
#  Year School Conf Class Pos  G Rec Yds  Avg TD Att Yds Avg TD Plays Yds  Avg TD
# 1 1969 Dayton  Ind        WR 10   1   3  3.0  1                    1   3  3.0  1
# 2 1970 Dayton  Ind        WR 10   4  42 10.5  1                    4  42 10.5  1

在 R 中嵌入 html table

Scraping embeded html table in R

html

xml

parsing

r

web-scraping