read_html(url) 和 read_html(content(GET(url), "text")) 之间的区别

Question

我正在查看这个很棒的答案：。

解决方案的开头包括：

library(httr)
library(xml2)

gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(content(gr, "text"))

xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")

输出在多个请求中保持不变：

"59243d3a2....61f8f73136118f9"

到目前为止我的默认方式是：

doc <- read_html("https://nzffdms.niwa.co.nz/search")
xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value")

该结果与上面的输出不同，并且会在多个请求中发生变化。

问题：

两者有什么区别：

read_html(url)
read_html(content(GET(url), "text"))

为什么它会导致不同的值，为什么只有 "GET" 解决方案 Returns 链接问题中的 csv？

（我希望可以用三个子问题来构造它）。

我试过的：

深入函数调用的兔子洞：

read_html
(ms <- methods("read_html"))
getAnywhere(ms[1])
xml2:::read_html
xml2:::read_html.default
#xml2:::read_html.response

read_xml
(ms <- methods("read_xml"))
getAnywhere(ms[1])

但这导致了这个问题：

想法：

我没有看到 get 请求接受任何 headers 或 Cookie，那可以解释不同的回应。
根据我的理解 read_html 和 read_html(content(GET(.), "text")) 都会 return XML/html.
好的，在这里我不确定检查是否有意义，但是因为我运行没有想法：我检查了是否有某种缓存正在进行。

代码：

with_verbose(GET("https://nzffdms.niwa.co.nz/search"))
....
<- Expires: Thu, 19 Nov 1981 08:52:00 GMT
<- Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0

--> 在我看来缓存可能不是解决方案。

查看 help("GET") 给出了有关 "conditional GET" 的有趣部分：

The semantics of the GET method change to a "conditional GET" if the request message includes an If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range header field. A conditional GET method requests that the entity be transferred only under the circumstances described by the conditional header field(s). The conditional GET method is intended to reduce unnecessary network usage by allowing cached entities to be refreshed without requiring multiple requests or transferring data already held by the client.

但据我所知 with_verbose() None of If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range 已设置。

Answer 1

不同之处在于重复调用 httr::GET，句柄在调用之间保持不变。使用 xml2::read_html()，每次都会建立一个新连接。

来自 httr 文档：

The handle pool is used to automatically reuse Curl handles for the same scheme/host/port combination. This ensures that the http session is automatically reused, and cookies are maintained across requests to a site without user intervention.

来自 xml2 文档，讨论传递给 read_html() 的字符串参数：

A string can be either a path, a url or literal xml. Urls will be converted into connections either using base::url or, if installed, curl::curl

所以你的答案是 read_html(GET(url)) 就像刷新浏览器，但是 read_html(url) 就像关闭浏览器并打开一个新浏览器。服务器在它传送的页面上给出一个唯一的会话 ID。新会话，新 ID。您可以通过调用 httr::reset_handle(url):

来证明这一点

library(httr)
library(xml2)

# GET the page (note xml2 handles httr responses directly, don't need content("text"))
gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(gr)
print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

# A new GET using the same handle gets exactly the same response
gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(gr)
print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

# Now call GET again after resetting the handle
httr::handle_reset("https://nzffdms.niwa.co.nz/search")
gr <- GET("https://nzffdms.niwa.co.nz/search")
doc <- read_html(gr)
print(xml_attr(xml_find_all(doc, ".//input[@name='search[_csrf_token]']"), "value"))

在我的例子中，采购上面的代码给我：

[1] "ecd9be7c75559364a2a5568049c0313f"
[1] "ecd9be7c75559364a2a5568049c0313f"
[1] "d953ce7acc985adbf25eceb89841c713"

read_html(url) 和 read_html(content(GET(url), "text")) 之间的区别

Difference between read_html(url) and read_html(content(GET(url), "text"))

get

r

rvest

xml2