如何从简体中文网站抓取内容？

Question

我已经在各种英文网站上测试了这段代码，没有问题。然而，当我试图从一个简体中文网站上抓取内容时，数据在 CSV 文件中显示为乱码。此外，文章正文分布在 Excel 中的多行中，未包含在一个单元格中。有人可以帮忙吗？

install.packages("rvest")
library(rvest)

###specifying the URL for the website you want to scrape ###
url <-'https://new.qq.com/omn/20190823/20190823A02W4Q00.html'

##reading the HTML code from the website
webpage <- read_html(url)

###using CSS selectors to scrape the title
title_html <- html_nodes(webpage,'h1')

###Converting the main text data to text
title_data <- html_text(title_html)

###using CSS selectors to scrape the body
text_html <- html_nodes(webpage,'.one-p')

###Converting the body data to text
text_data <- html_text(text_html)


d <- data.frame(text_data)
write.csv(d,"chinesetext.csv")

Answer 1

这些问题大部分都是编码引起的。我尝试 guess_encoding 功能。它猜到了 UTF-8 编码。但它不起作用。你可以看到这段代码。

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
input conversion failed due to input error, bytes 0xC8 0xDD 0x2D 0x2D [6003]

所以我改用扩展 Unix 代码。有效。

url <-'https://new.qq.com/omn/20190823/20190823A02W4Q00.html'
webpage <- read_html(url, encoding="euc-cn")
title_html <- html_nodes(webpage,'h1')
title_data <- html_text(title_html)
title_data
[1] "“六稳”政策显效 抗压能力增强"

也许，您想将数据框转换成中文。在您的代码之前，添加此代码。然后你就可以在全局环境中看到中文了。

Sys.setlocale("LC_ALL", "Chinese")

如何从简体中文网站抓取内容？

How can I scrape content from a website that's in Simplified Chinese?

r

cjk

rvest