使用 readHTMLTable 从 URL 抓取数据后,如何将结果转换为数据框?

After scraping data from a URL using readHTMLTable, how can I convert the result to a data frame?

我尝试了各种不同的操作,但我的基本问题是:

url<- "http://www.ref.org.uk/fuel/tablebysp.php?valdate=2015-03-08"
data <- readHTMLTable(url,header = TRUE,as.data.frame =TRUE,which=2)
typeof(data)

我的数据看起来不错,但我无法将其强制转换为数据框。我不知道是什么阻止了我。

正如 post 评论中指出的那样,您的代码实际上工作正常。您可以获得字符串而不是因子:

url<- "http://www.ref.org.uk/fuel/tablebysp.php?valdate=2015-03-08"
data <- readHTMLTable(url, header=TRUE, as.data.frame=TRUE, which=2, 
                      stringsAsFactors=FALSE)

下面是如何使用 rvest 包来做到这一点,它真的很适合这个,特别是当有多个 table 或奇怪的嵌套时。而且,还有一些 dplyr

在这种情况下,不止一个table,第二个就是你想要的。值得庆幸的是,它的格式非常好。下面的代码从页面中提取所有 tables(使用 CSS 选择器)然后使用方便的 magrittr extract2 来避免 weird/ugly [[]] 用法。

管道习语(以 magrittr 开头,现在在 hadleyverse 的大部分内容中使用)"pushes" 或 "flows" 从左到右的数据与 "pops" 数据从嵌套括号调用中退出。

library(rvest)
library(magrittr)
library(dplyr)

pg <- html("http://www.ref.org.uk/fuel/tablebysp.php?valdate=2015-03-08")
dat <- pg %>% html_nodes("table") %>% extract2(2) %>% html_table(header=TRUE)
glimpse(dat)

## Observations: 48
## Variables:
## $ SD         (chr) "2015-03-08", "2015-03-08", "2015-03-08", "2015-03-08", "2015-03-08", "2015-03-08", "...
## $ SP         (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24...
## $ Gas        (chr) "3,467", "3,522", "3,594", "3,529", "2,811", "2,538", "2,520", "2,489", "2,498", "2,5...
## $ Coal       (chr) "8,261", "8,062", "7,876", "7,437", "6,751", "6,799", "6,621", "6,428", "6,586", "6,2...
## $ Nuclear    (chr) "7,495", "7,553", "7,641", "7,676", "7,674", "7,672", "7,676", "7,677", "7,672", "7,6...
## $ Hydro      (int) 737, 729, 666, 651, 646, 647, 645, 648, 658, 729, 734, 736, 740, 738, 740, 741, 751, ...
## $ Net Pumped (chr) "-438", "-84", "-504", "-860", "-1,092", "-1,118", "-1,396", "-1,700", "-1,606", "-1,...
## $ Wind       (chr) "4,675", "4,795", "4,623", "4,572", "4,647", "4,570", "4,377", "4,445", "4,602", "4,5...
## $ OCGT       (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Oil        (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Biomass    (chr) "1,078", "1,079", "1,081", "1,048", "1,005", "1,022", "1,086", "1,035", "1,072", "1,0...
## $ French Int (chr) "480", "480", "680", "678", "1,614", "1,626", "1,532", "1,536", "772", "772", "504", ...
## $ Dutch Int  (chr) "860", "852", "874", "838", "850", "866", "848", "830", "830", "866", "862", "862", "...
## $ NI Int     (int) 22, -72, 2, -16, -30, -50, 4, 4, -122, -138, -108, -114, -2, 16, 24, 24, 28, -30, -42...
## $ Eire Int   (int) 170, 190, 142, 142, 142, 114, 114, 114, 114, 112, 88, 50, 16, 16, 16, 42, 18, -72, -1...
## $ Net Supply (chr) "26,807", "27,106", "26,675", "25,695", "25,018", "24,686", "24,027", "23,506", "23,0...

您也可以这样做:

html_table(extract2(html_nodes(pg, "table"), 2), header=TRUE)

如果您不喜欢或通常不使用管道。

然后您可以对列进行一些基本清理以获得有用的 numeric/date 值:

dat %>% 
  mutate(SD=as.Date(SD),
         Gas=as.numeric(gsub(",", "", Gas)),
         Coal=as.numeric(gsub(",", "", Coal)),
         Nuclear=as.numeric(gsub(",", "", Nuclear)),
         `Net Pumped`=as.numeric(gsub(",", "", `Net Pumped`)),
         `Wind`=as.numeric(gsub(",", "", `Wind`)),
         Biomass=as.numeric(gsub(",", "", Biomass)),
         `French Int`=as.numeric(gsub(",", "", `French Int`)),
         `Dutch Int`=as.numeric(gsub(",", "", `Dutch Int`)),
         `Net Supply`=as.numeric(gsub(",", "", `Net Supply`))) -> dat

glimpse(dat)

## Observations: 48
## Variables:
## $ SD         (date) 2015-03-08, 2015-03-08, 2015-03-08, 2015-03-08, 2015-03-08, 2015-03-08, 2015-03-08, ...
## $ SP         (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24...
## $ Gas        (dbl) 3467, 3522, 3594, 3529, 2811, 2538, 2520, 2489, 2498, 2543, 2531, 2522, 2627, 2729, 2...
## $ Coal       (dbl) 8261, 8062, 7876, 7437, 6751, 6799, 6621, 6428, 6586, 6229, 6194, 6299, 6455, 6639, 6...
## $ Nuclear    (dbl) 7495, 7553, 7641, 7676, 7674, 7672, 7676, 7677, 7672, 7670, 7673, 7677, 7677, 7681, 7...
## $ Hydro      (int) 737, 729, 666, 651, 646, 647, 645, 648, 658, 729, 734, 736, 740, 738, 740, 741, 751, ...
## $ Net Pumped (dbl) -438, -84, -504, -860, -1092, -1118, -1396, -1700, -1606, -1632, -1344, -1052, -1342,...
## $ Wind       (dbl) 4675, 4795, 4623, 4572, 4647, 4570, 4377, 4445, 4602, 4570, 4529, 4512, 4312, 3976, 3...
## $ OCGT       (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Oil        (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Biomass    (dbl) 1078, 1079, 1081, 1048, 1005, 1022, 1086, 1035, 1072, 1086, 1084, 1085, 1086, 1082, 1...
## $ French Int (dbl) 480, 480, 680, 678, 1614, 1626, 1532, 1536, 772, 772, 504, 502, 1598, 1602, 1878, 188...
## $ Dutch Int  (dbl) 860, 852, 874, 838, 850, 866, 848, 830, 830, 866, 862, 862, 884, 846, 942, 914, 1032,...
## $ NI Int     (int) 22, -72, 2, -16, -30, -50, 4, 4, -122, -138, -108, -114, -2, 16, 24, 24, 28, -30, -42...
## $ Eire Int   (int) 170, 190, 142, 142, 142, 114, 114, 114, 114, 112, 88, 50, 16, 16, 16, 42, 18, -72, -1...
## $ Net Supply (dbl) 26807, 27106, 26675, 25695, 25018, 24686, 24027, 23506, 23076, 22807, 22747, 23079, 2...