html in rvest verses html解析 XML
html in rvest verses htmlParse in XML
如以下代码所示,rvest
包中的 html
使用 XML
包中的 htmlParse
。 .
html
function (x, ..., encoding = NULL)
{
parse(x, XML::htmlParse, ..., encoding = encoding)
}
<environment: namespace:rvest>
htmlParse
function (file, ignoreBlanks = TRUE, handlers = NULL, replaceEntities = FALSE,
asText = FALSE, trim = TRUE, validate = FALSE, getDTD = TRUE,
isURL = FALSE, asTree = FALSE, addAttributeNamespaces = FALSE,
useInternalNodes = TRUE, isSchema = FALSE, fullNamespaceInfo = FALSE,
encoding = character(), useDotNames = length(grep("^\.",
names(handlers))) > 0, xinclude = TRUE, addFinalizer = TRUE,
error = htmlErrorHandler, isHTML = TRUE, options = integer(),
parentFirst = FALSE)
.....
因此,对于以下 url:
myurl<-"http://www.nepalstock.com.np/"
parse_XML<-htmlParse(myurl) #runs without error
parse_rvest<-html(myurl) # throws out the Internal Sever error
Error in parse.response(r, parser, encoding = encoding) :
server error: (500) Internal Server Error
有什么想法吗?
从底层 httr::GET
请求重置默认用户代理,然后它工作:
library(httr)
library(rvest)
parse_rvest <- html(myurl, add_headers("User-Agent" = "myagent"))
或
parse_rvest <- html(myurl, user_agent("myagent"))
请注意,出于调试目的,您可以将 verbose()
添加到 html(...)
。
添加:
使用新的 rvest
/ xml2
/ curl
组合,它应该看起来像:
library(xml2)
library(rvest)
library(curl)
parse_rvest <- curl(myurl, handle = new_handle("useragent" = "myua")) %>%
read_html()
如以下代码所示,rvest
包中的 html
使用 XML
包中的 htmlParse
。 .
html
function (x, ..., encoding = NULL)
{
parse(x, XML::htmlParse, ..., encoding = encoding)
}
<environment: namespace:rvest>
htmlParse
function (file, ignoreBlanks = TRUE, handlers = NULL, replaceEntities = FALSE,
asText = FALSE, trim = TRUE, validate = FALSE, getDTD = TRUE,
isURL = FALSE, asTree = FALSE, addAttributeNamespaces = FALSE,
useInternalNodes = TRUE, isSchema = FALSE, fullNamespaceInfo = FALSE,
encoding = character(), useDotNames = length(grep("^\.",
names(handlers))) > 0, xinclude = TRUE, addFinalizer = TRUE,
error = htmlErrorHandler, isHTML = TRUE, options = integer(),
parentFirst = FALSE)
.....
因此,对于以下 url:
myurl<-"http://www.nepalstock.com.np/"
parse_XML<-htmlParse(myurl) #runs without error
parse_rvest<-html(myurl) # throws out the Internal Sever error
Error in parse.response(r, parser, encoding = encoding) :
server error: (500) Internal Server Error
有什么想法吗?
从底层 httr::GET
请求重置默认用户代理,然后它工作:
library(httr)
library(rvest)
parse_rvest <- html(myurl, add_headers("User-Agent" = "myagent"))
或
parse_rvest <- html(myurl, user_agent("myagent"))
请注意,出于调试目的,您可以将 verbose()
添加到 html(...)
。
添加:
使用新的 rvest
/ xml2
/ curl
组合,它应该看起来像:
library(xml2)
library(rvest)
library(curl)
parse_rvest <- curl(myurl, handle = new_handle("useragent" = "myua")) %>%
read_html()