R 中的解析与抓取

Parsing versus Scraping in R

我是 R 的新手,对构建数据库的最有效方法有疑问。我想建立一个 NFL 统计数据库。这些统计数据可以在网络上的许多位置轻松获得,但我发现最彻底的分析是在 Pro-Football-Reference (http://www.pro-football-reference.com/). This will be panel data where the time intervals are each week of each season, my observations are each player in each game, and my columns are the statistics tallied in all of the tables of Pro-Football-Reference's boxscores (http://www.pro-football-reference.com/boxscores/201702050atl.htm)。

我可以在每个赛季的每场比赛中抓取每场 table,例如:

#PACKAGES
library(rvest)
library(XML)
page.201702050atl = read_html("http://www.pro-football-reference.com/boxscores/201702050atl.htm")
comments.201702050atl = page.201702050atl %>% html_nodes(xpath = "//comment()")
scoring.201702050atl = readHTMLTable("http://www.pro-football-reference.com/boxscores/201702050atl.htm", which = 2)
game.info.201702050atl = comments.201702050atl[17] %>% html_text() %>% read_html() %>% html_node("#game_info") %>% html_table()
officials.201702050atl = comments.201702050atl[21] %>% html_text() %>% read_html() %>% html_node("#officials") %>% html_table()
team.stats.201702050atl = comments.201702050atl[27] %>% html_text() %>% read_html() %>% html_node("#team_stats") %>% html_table()
scorebox.201702050atl = readHTMLTable("http://www.pro-football-reference.com/boxscores/201702050atl.htm", which = 1)
expected.points.201702050atl = comments.201702050atl[22] %>% html_text() %>% read_html() %>% html_node("#expected_points") %>% html_table()
player.offense.201702050atl = comments.201702050atl[31] %>% html_text() %>% read_html() %>% html_node("#player_offense") %>% html_table()
player.defense.201702050atl = comments.201702050atl[32] %>% html_text() %>% read_html() %>% html_node("#player_defense") %>% html_table()
returns.201702050atl = comments.201702050atl[33] %>% html_text() %>% read_html() %>% html_node("#returns") %>% html_table()
kicking.201702050atl = comments.201702050atl[34] %>% html_text() %>% read_html() %>% html_node("#kicking") %>% html_table()
home.starters.201702050atl = comments.201702050atl[35] %>% html_text() %>% read_html() %>% html_node("#home_starters") %>% html_table()
vis.starters.201702050atl = comments.201702050atl[36] %>% html_text() %>% read_html() %>% html_node("#vis_starters") %>% html_table()
home.snap.counts.201702050atl = comments.201702050atl[37] %>% html_text() %>% read_html() %>% html_node("#home_snap_counts") %>% html_table()
vis.snap.counts.201702050atl = comments.201702050atl[38] %>% html_text() %>% read_html() %>% html_node("#vis_snap_counts") %>% html_table()
targets.directions.201702050atl = comments.201702050atl[39] %>% html_text() %>% read_html() %>% html_node("#targets_directions") %>% html_table()
rush.directions.201702050atl = comments.201702050atl[40] %>% html_text() %>% read_html() %>% html_node("#rush_directions") %>% html_table()
pass.tackles.201702050atl = comments.201702050atl[41] %>% html_text() %>% read_html() %>% html_node("#pass_tackles") %>% html_table()
rush.tackles.201702050atl = comments.201702050atl[42] %>% html_text() %>% read_html() %>% html_node("#rush_tackles") %>% html_table()
home.drives.201702050atl = comments.201702050atl[43] %>% html_text() %>% read_html() %>% html_node("#home_drives") %>% html_table()
vis.drives.201702050atl = comments.201702050atl[44] %>% html_text() %>% read_html() %>% html_node("#vis_drives") %>% html_table()
pbp.201702050atl = comments.201702050atl[45] %>% html_text() %>% read_html() %>% html_node("#pbp") %>% html_table()

然而,每年清理 256 场比赛的每个刮擦 table 所需的代码行数似乎表明可能存在更有效的方法。

NFL 在他们的比赛手册中正式记录统计数据 (http://www.nfl.com/liveupdate/gamecenter/57167/ATL_Gamebook.pdf)。由于像 Pro-Football-Reference 这样的网站包含官方游戏书籍中未记录的统计数据,并且由于这样做所需的识别语言包含在游戏书籍的逐场比赛中,我推断它们是 运行 一个函数来解析 Play-by-P​​lay 并计算他们的统计数据。尽管我是新手,但我以前从未在 R 中编写过函数或解析过任何东西;但是,我想我可以将一个函数应用于每本游戏书,这将是一种比抓取每个人 table 更有效的方法。我走在正确的道路上吗?我不想在错误的方向上投入大量精力。

由于游戏书籍是 PDF 格式,因此出现了另一个问题。 Play-by-P​​lays 在其他网站上以 table 格式存在,但 none 是完整的。我在这个网站上阅读了一些关于如何使用

将 PDF 转换为文本的优秀教程
library(tm)

但是,我还没有想出它的用途。

一旦我将整个 PDF 转换为文本,我是否只需识别逐个播放部分,将其解析出来,然后从那里解析出每个统计数据?我有限的经验是否还有其他障碍使我无法预见?

这个网站的问题可能太"beginner"了;但是,有人可以在这里安排我吗?或者,为我提供可以的资源?非常感谢您的帮助。

考虑通过将 html table 存储在所有 256 款游戏的不断增长的列表中,将您的一款游戏解析推广到所有游戏。以下是第 1 周的示例。

doc <- htmlParse(readLines("http://www.pro-football-reference.com/years/2016/week_1.htm"))

# EXTRACT ALL GAME PAGES
games <- xpathSApply(doc, "//a[text()='Final']/@href")

# FUNCTION TO BUILD HTML TABLE LIST
getwebdata <- function(url) {
    print(url)
    boxscoreurl <- paste0("http://www.pro-football-reference.com", url)
    page <- read_html(boxscoreurl)
    comments <- page %>% html_nodes(xpath = "//comment()")

    list(
      scoring = readHTMLTable(boxscoreurl, which = 2),
      game.info = comments[17] %>% html_text() %>% read_html() %>% html_node("#game_info") %>% html_table(),
      officials = comments[21] %>% html_text() %>% read_html() %>% html_node("#officials") %>% html_table(),
      team.stats = comments[27] %>% html_text() %>% read_html() %>% html_node("#team_stats") %>% html_table(),
      scorebox = readHTMLTable(boxscoreurl, which = 1),
      expected.points = comments[22] %>% html_text() %>% read_html() %>% html_node("#expected_points") %>% html_table(),
      player.offense = comments[31] %>% html_text() %>% read_html() %>% html_node("#player_offense") %>% html_table(),
      player.defense = comments[32] %>% html_text() %>% read_html() %>% html_node("#player_defense") %>% html_table(),
      returns = comments[33] %>% html_text() %>% read_html() %>% html_node("#returns") %>% html_table(),
      kicking = comments[34] %>% html_text() %>% read_html() %>% html_node("#kicking") %>% html_table(),
      home.starters = comments[35] %>% html_text() %>% read_html() %>% html_node("#home_starters") %>% html_table(),
      vis.starters = comments[36] %>% html_text() %>% read_html() %>% html_node("#vis_starters") %>% html_table(),
      home.snap.counts = comments[37] %>% html_text() %>% read_html() %>% html_node("#home_snap_counts") %>% html_table(),
      vis.snap.counts = comments[38] %>% html_text() %>% read_html() %>% html_node("#vis_snap_counts") %>% html_table(),
      targets.directions = comments[39] %>% html_text() %>% read_html() %>% html_node("#targets_directions") %>% html_table(),
      rush.directions = comments[40] %>% html_text() %>% read_html() %>% html_node("#rush_directions") %>% html_table(),
      pass.tackles = comments[41] %>% html_text() %>% read_html() %>% html_node("#pass_tackles") %>% html_table(),
      rush.tackles = comments[42] %>% html_text() %>% read_html() %>% html_node("#rush_tackles") %>% html_table(),
      home.drives = comments[43] %>% html_text() %>% read_html() %>% html_node("#home_drives") %>% html_table(),
      vis.drives = comments[44] %>% html_text() %>% read_html() %>% html_node("#vis_drives") %>% html_table(),
      pbp = comments[45] %>% html_text() %>% read_html() %>% html_node("#pbp") %>% html_table()
    )
}

# ALL WEEK ONE LIST OF HTML TABLE(S) DATA
week1datalist <- lapply(games, getwebdata)

# TRY/CATCH VERSION (ANY PARSING ERROR TO RETURN EMPTY LIST)
week1datalist <- lapply(games, function(g) {
   tryCatch({ return(getwebdata(g)) 
      }, error = function(e) return(list())
})

# NAME EACH LIST ELEMENT BY CORRESPONDING GAME
shortgames <- gsub("/", "", gsub(".htm", "", games))
week1datalist <- setNames(week1datalist, shortgames)

最终,您可以按名称参考一场比赛的具体统计数据table:

week1datalist$boxscores201609080den$scoring

week1datalist$boxscores201609110atl$game.info

此外,您可能需要在 lapply 中包含 tryCatch,因为某些页面可能不一致。