将 HTML table 提取到 R

Extracting HTML table into R

我一直在尝试从网页中提取 table。该数据是来自实时航班跟踪网站(https://flightaware.com/live/flight/WJA1508/history/20150814/1720Z/CYYC/KSFO/tracklog)的航班轨迹数据。

我试过 XML、RCurl 和 Curl 包,但我没有用。我相信最有可能是因为我无法弄清楚如何避免 SSL 以及包含航班状态注释的列(即 table 顶部的前两个和底部的第三个)。

有谁知道如何提取这个 table int R 吗?

正如@hrbrmstr 在上面的评论中指出的那样,这违反了 FlightAware 的服务条款,但您​​如何处理您的代码是您的事。 :) 使用 rvest 包,这应该可以帮助您完成大部分工作:

library(rvest)

u <- "https://flightaware.com/live/flight/WJA1508/history/20150814/1720Z/CYYC/KSFO/tracklog"

html_read <- html(u)
tbl <- html_table(
  html_nodes(html_read, "table"), 
  fill=TRUE, 
  header=FALSE, 
  trim=TRUE 
)[[2]]

##  Subset to the first row of data and remove all extra
##    columns:
tbl_o <- tbl[6:nrow(tbl), ]
tbl_o <- tbl_o[,colSums(is.na(tbl_o))!=nrow(tbl_o)]

names(tbl_o) <- c(
  "Time", "Lat", "Lon", 
  "Course", "Direction", 
  "KTS", "MPH", "Alt", 
  "Rate", "Location"
)

str(tbl_o)

产生:

'data.frame':   292 obs. of  10 variables:
 $ Time     : chr  "Fri 01:41:34 PM" "Fri 01:48:59 PM" "Fri 01:49:14 PM" "Fri 01:50:05 PM" ...
 $ Lat      : chr  "51.0833" "51.1551" "51.1683" "51.2235" ...
 $ Lon      : chr  "-113.9667" "-114.0209" "-114.0209" "-114.0220" ...
 $ Course   : chr  "335°" "0°" "0°" "358°" ...
 $ Direction: chr  "Northwest" "North" "North" "North" ...
 $ KTS      : chr  "20" "201" "219" "149" ...
 $ MPH      : chr  "23" "231" "252" "171" ...
 $ Alt      : chr  "3,500" "4,900" "5,200" "6,800" ...
 $ Rate     : chr  "" "222" "1,727" "1,701" ...
 $ Location : chr  "Edmonton Center" "FlightAware ADS-B  (CYYC)" "FlightAware ADS-B  (CYYC)" "FlightAware ADS-B  (CEG2)" ...