用 R 抓取网站:XML 内容似乎不是 XML
Scraping website with R: XML content does not seem to be XML
我正在尝试从网站 (https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks) 中抓取 table,但我尝试了多种方法都没有成功。当我 运行 下面的代码时,出现以下错误: XML content does not seem to be XML
library("XML")
library("RCurl")
readHTMLTable("https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks")
None 下面使用 RCurl
的方法中的一种有效:
rts.url <- getURL("https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks")
xmlParse(rts.url)
xmlInternalTreeParse(rts.url)
readHTMLTable(rts.url)
httr
没有成功:
library("httr")
GET("https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks")
rvest
没有成功:
library("rvest")
read_html("https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks")
我对 RSelenium 不太熟悉,但这是我基于文档中示例的尝试:
library("RSelenium")
startServer()
remDr <- remoteDriver$new()
remDr$open()
remDr$navigate("https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks")
Error: Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
class: org.openqa.selenium.UnsupportedCommandException
对于像这个这样的棘手表格,我经常发现从 Firebug 或开发人员工具中找到 xpath 通常是最有用的选择。
library("RSelenium")
startServer()
remDr <- remoteDriver$new()
remDr$open()
remDr$navigate("https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks")
player_table <- remDr$findElement('xpath', '/html/body/div[2]/div[3]/table/tbody')
print(player_table$getElementText())
players <- strsplit(player_table$getElementText()[[1]], "\n")
final <- c()
for(x in players[[1]]){
temp <- unlist(strsplit(x, " "))
final <- rbind(final, temp)
}
final <- data.frame(final)
R> print(head(final))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
1 1. Aaron Rodgers GNB 4 39 4391 0 2 311 54 27
2 2. Cam Newton CAR 7 33 3982 0 9 651 130 27
3 3. Andrew Luck IND 10 36 4769 0 2 283 60 26
4 4. Drew Brees NOR 5 33 4925 0 1 41 26 22
5 5. Ben Roethlisberger PIT 8 35 4916 0 0 43 31 20
6 6. Russell Wilson SEA 5 34 4063 0 4 592 109 20
我意识到 for
循环不太理想,但有时对于抓取这样的网页来说它可能是最好的选择。
我正在尝试从网站 (https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks) 中抓取 table,但我尝试了多种方法都没有成功。当我 运行 下面的代码时,出现以下错误: XML content does not seem to be XML
library("XML")
library("RCurl")
readHTMLTable("https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks")
None 下面使用 RCurl
的方法中的一种有效:
rts.url <- getURL("https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks")
xmlParse(rts.url)
xmlInternalTreeParse(rts.url)
readHTMLTable(rts.url)
httr
没有成功:
library("httr")
GET("https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks")
rvest
没有成功:
library("rvest")
read_html("https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks")
我对 RSelenium 不太熟悉,但这是我基于文档中示例的尝试:
library("RSelenium")
startServer()
remDr <- remoteDriver$new()
remDr$open()
remDr$navigate("https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks")
Error: Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
class: org.openqa.selenium.UnsupportedCommandException
对于像这个这样的棘手表格,我经常发现从 Firebug 或开发人员工具中找到 xpath 通常是最有用的选择。
library("RSelenium")
startServer()
remDr <- remoteDriver$new()
remDr$open()
remDr$navigate("https://www.freedraftguide.com/fantasy-football/rankings/quarterbacks")
player_table <- remDr$findElement('xpath', '/html/body/div[2]/div[3]/table/tbody')
print(player_table$getElementText())
players <- strsplit(player_table$getElementText()[[1]], "\n")
final <- c()
for(x in players[[1]]){
temp <- unlist(strsplit(x, " "))
final <- rbind(final, temp)
}
final <- data.frame(final)
R> print(head(final))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
1 1. Aaron Rodgers GNB 4 39 4391 0 2 311 54 27
2 2. Cam Newton CAR 7 33 3982 0 9 651 130 27
3 3. Andrew Luck IND 10 36 4769 0 2 283 60 26
4 4. Drew Brees NOR 5 33 4925 0 1 41 26 22
5 5. Ben Roethlisberger PIT 8 35 4916 0 0 43 31 20
6 6. Russell Wilson SEA 5 34 4063 0 4 592 109 20
我意识到 for
循环不太理想,但有时对于抓取这样的网页来说它可能是最好的选择。