尽管指定了不同的编码，read_html() 仍无法读取网页

Question

我试图使用 rvest/xml2 包中的 read_html() 函数将日文的 html 页面读入 R。

library(rvest)

url <- "https://www.post.japanpost.jp/kitte_hagaki/stamp/kogata/index.php?p=4"
read_html(url)

然而，代码行不断抛出错误信息。

> read_html(url)
Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  input conversion failed due to input error, bytes 0xAD 0xA1 0xCD 0xB9 [6003]

函数 guess_encoding() return 可能的编码列表如下，而网页的 header HTML 声明 charset=euc-jp.

     encoding language confidence
1  ISO-8859-1       sv       0.31
2  ISO-8859-2       cs       0.22
3       UTF-8                0.15
4  ISO-8859-9       tr       0.13
5    UTF-16BE                0.10
6    UTF-16LE                0.10
7   Shift_JIS       ja       0.10
8     GB18030       zh       0.10
9      EUC-JP       ja       0.10
10     EUC-KR       ko       0.10
11       Big5       zh       0.10

读取其他编码的网页，例如EUC-JP，产生同样的错误。虽然将编码指定为 ISO-8859-1 并没有 return 错误，但编码只是不正确，字符都被错误地解析了。

> read_html(url, encoding = "EUC-JP")
Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  input conversion failed due to input error, bytes 0xAD 0xA1 0xCD 0xB9 [6003]

那么我怎样才能正确阅读这个 HTML 网页呢？谢谢。

Answer 1

我不确定为什么将编码声明为“EUC-JP”不起作用，但这里有一个迂回的阅读方式 URL:

library(rvest)
url <- "https://www.post.japanpost.jp/kitte_hagaki/stamp/kogata/index.php?p=4"
bytes <- readLines(url)
#> Warning in readLines(url): incomplete final line found on 'https://
#> www.post.japanpost.jp/kitte_hagaki/stamp/kogata/index.php?p=4'
utf8 <- iconv(bytes, from="EUC-JP", to="UTF-8")  
html <- read_html(charToRaw(paste(utf8, collapse="\n")), encoding="UTF-8")
html
#> {html_document}
#> <html lang="ja">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body onload="SetInfo (0)">\n\n<div id="wrap-outer">\n<div id="wrap-inner ...

^{由 reprex package (v2.0.0)}

创建于 2021-08-21

尽管指定了不同的编码，read_html() 仍无法读取网页

read_html() cannot read a webpage despite specifying different encodings

encoding

r

rvest