在 R 中使用 getURL() 抓取网页时如何避免空字符？

Question

我在使用 RCurl 的 getURL() 功能抓取网站时遇到问题。例如 http://dogecoin.com 它 returns 一个错误说 NULL 字符在链的中间（文字翻译。

> x <- getURL("http://dogecoin.com/")
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : 
  caractère nul au milieu de la chaîne : '7\x8b\b[=10=][=10=][=10=][=10=][=10=][=10=][=10=]3\xed]\xebr\xdbF\x96\xfe5=E\x9b\xa9\x89\xe4]\x82\xd4͗8\x92R\xb2|\x9d\x91-\x97\xa5\xac7\x95IiA[=10=]2$a\x81[=10=]\x82\x8bhf2ﰯ\xb1\xaf\xb1\xfbb\xfb}\xa73WB2E{\xa625\xa5*2I\xf4\xbdO\x9f\xcbw\xcei\xec\xdd{vrt\xf6\xe3\xbb\xe7j\x92N\xfd\x83\xb5\xbd\xfc7\xd7v6\xd64\xfeۻgY\xf2\xc1\xfcw~^\xf9\xb2\xf0\xf1\xdc4=\xc77}\xd5\xc7_\xa5\xf8Y\xff\x91\xbf\xf2\x9fR3\xe7\xf7\xf1\xaf.\xde\xc7[=10=]3\xa5\xe4\xef_\xe5\xef7\xe1\xaf\xfe\x88V\xf4\xaf\xfa_)φ\xf17\xed/JI7ů|\xae\xfaR\xfe\xaf\xe7\xe7\xdd\xf3>\xfe\xe21?+\xf9\\xfe9Gs\xbabu\xa2ғ\xd4Y\x93\x9f\x93PM\xed\xf8"\x8bz\xeaҍ\xe7j\xe662/u{j6\xcezR²6\xd6־₩7nj\xcb\xf7\xaf\xf6R/\xf5݃g\xe1\xd85\x86^\xb0\xd7\xd7\xdf\xf1`\xca2É5'n\xba\xdf\xc9ґ\xf5\xb8\xc3\n\xfa\xf70H\xdd[=10=]\xbf\xe75\x95\x97(;Pa\xe4[=10=]60J67]5\xb9nl\xa5\xa1\xc57\x95㍽\xd4\xf6\xd50\x8bc70λjd_\x86\xb1\xeb\xa8\xc1\\x9dN\xbc\x81\xad^\aY\x82\xd1

在极少数情况下它是 returns 干净的 HTML 代码，但大多数时候我都会遇到此错误。这似乎与他们的网站有关，如您所见，有几个奇怪的字符，例如₩和4͗。

一种选择是使用 getURLcontent() 下载原始数据，但我无法将二进制内容转换为 HTML。

我尝试更改 .encoding 参数，但没有给出预期的结果。我如何抓取此网页？

编辑：详细模式

> getURL("http://dogecoin.com/", verbose = TRUE)
*   Trying 192.30.252.153...
* Connected to dogecoin.com (192.30.252.153) port 80 (#0)
> GET / HTTP/1.1
Host: dogecoin.com
Accept: */*

< HTTP/1.1 200 OK
< Server: GitHub.com
< Date: Wed, 25 Oct 2017 10:12:26 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Last-Modified: Tue, 16 May 2017 01:27:52 GMT
< Access-Control-Allow-Origin: *
< Expires: Wed, 25 Oct 2017 10:05:08 GMT
< Cache-Control: max-age=600
< Content-Encoding: gzip
< X-GitHub-Request-Id: A4D0:66A8:93356A1:D740FF7:59F0638A
< 
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : 
  caractère nul au milieu de la chaîne : '7\x8b\b[=11=][=11=][=11=][=11=][=11=][=11=][=11=]3\xed]\xebr\xdbF\x96\xfe5=E\x9b\xa9\x89\xe4]\x82\xd4͗8\x92R\xb2|\x9d\x91-\x97\xa5\xac7\x95IiA[=11=]2$a\x81[=11=]\x82\x8bhf2ﰯ\xb1\xaf\xb1\xfbb\xfb}\xa73WB2E{\xa625\xa5*2I\xf4\xbdO\x9f\xcbw\xcei\xec\xdd{vrt\xf6\xe3\xbb\xe7j\x92N\xfd\x83\xb5\xbd\xfc7\xd7v6\xd64\xfeۻgY\xf2\xc1\xfcw~^\xf9\xb2\xf0\xf1\xdc4=\xc77}\xd5\xc7_\xa5\xf8Y\xff\x91\xbf\xf2\x9fR3\xe7\xf7\xf1\xaf.\xde\xc7[=11=]3\xa5\xe4\xef_\xe5\xef7\xe1\xaf\xfe\x88V\xf4\xaf\xfa_)φ\xf17\xed/JI7ů|\xae\xfaR\xfe\xaf\xe7\xe7\xdd\xf3>\xfe\xe21?+\xf9\\xfe9Gs\xbabu\xa2ғ\xd4Y\x93\x9f\x93PM\xed\xf8"\x8bz\xeaҍ\xe7j\xe662/u{j6\xcezR²6\xd6־₩7nj\xcb\xf7\xaf\xf6R/\xf5݃g\xe1\xd85\x86^\xb0\xd7\xd7\xdf\xf1`\xca2É5'n\xba\xdf\xc9ґ\xf5\xb8\xc3\n\xfa\xf70H\xdd[=11=]\xbf\xe75\x95\x97(;Pa\xe4[=11=]60J67]5\xb9nl\xa5\xa1\xc57\x95㍽\xd4\xf6\xd50\x8bc70λjd_\x86\xb1\xeb\xa8\xc1\\x9dN\xbc\x81\xad^\aY\x82\xd1
>

Answer 1

RCurl::getURL() 似乎没有检测到 Content-Encoding: gzip header 也没有检测到 tell-tale 前两个字节 "magic" 代码也表示内容是 gzip编码。

我会建议 - 正如迈克尔所做的那样 - 切换到 httr 原因我稍后会讨论，但这会更好 httr 成语：

library(httr)

res <- GET("http://dogecoin.com/")
content(res)

content() 函数提取原始响应和 returns 一个 xml2 object 类似于 XML 库解析 object考虑到 RCurl::getURL().

的使用，您可能一直在使用

另一种方法是在 RCurl::getURL() 中添加一些拐杖:

html_text_res <- RCurl::getURL("http://dogecoin.com/", encoding="gzip")

在这里，我们明确通知 getURL() 内容是 gzip 压缩的，但这充满了危险，因为如果上游服务器决定使用，比如说，brotli 编码，那么你会得到一个错误。

如果您仍然想使用 RCurl 而不是切换到 httr 我建议您为此网站执行以下操作：

RCurl::getURL("http://dogecoin.com/", 
              encoding = "gzip",
              httpheader = c(`Accept-Encoding` = "gzip"))

这里给 getURL() 解码拐杖，但也明确告诉上游服务器 gzip 是并且它应该使用该编码发送数据。

但是，httr 会是更好的选择，因为它和它使用的 curl 包以更彻底的方式处理 Web 服务器交互和内容。

在 R 中使用 getURL() 抓取网页时如何避免空字符？

How to avoid null character when scraping a web page with getURL() in R?

r

rcurl