抓取汉字 python

Question

我学会了如何从 https://automatetheboringstuff.com. I wanted to scrap http://www.piaotian.net/html/3/3028/1473227.html 中删除内容为中文的网站并将其内容写入 .txt 文件。但是，.txt 文件包含随机符号，我认为这是一个 encoding/decoding 问题。

我已经阅读了这个主题“”并认为我网站的编码方法是 "gb2312" 和 "windows-1252"。我尝试用这两种编码方法解码但失败了。

有人可以向我解释一下我的代码的问题吗？我对编程很陌生，所以请让我知道我的误解！

此外，当我从代码中删除 "html.parser" 时，.txt 文件变成空的，而不是至少有符号。为什么会这样？

import bs4, requests, sys

reload(sys)
sys.setdefaultencoding("utf-8")

novel = requests.get("http://www.piaotian.net/html/3/3028/1473227.html")
novel.raise_for_status()

novelSoup = bs4.BeautifulSoup(novel.text, "html.parser")

content = novelSoup.select("br")

novelFile = open("novel.txt", "w")
for i in range(len(content)):
    novelFile.write(str(content[i].getText()))

Answer 1

novel = requests.get("http://www.piaotian.net/html/3/3028/1473227.html")
novel.raise_for_status()
novel.encoding = "GBK"
novelSoup = bs4.BeautifulSoup(novel.text, "html.parser")

输出：

<br>
    一元宗，坐落在青峰山上，绵延极长，现在是盛夏时节，天空之中，太阳慢慢落了下去，夕阳将影子拉的很长。<br/>
<br/>
    一片不是很大的小湖泊边上，一个约莫着十七八岁的青衣少年坐在湖边，抓起湖边的一块石头扔出，顿时在湖边打出几朵浪花。<br/>
<br/>
    叶希文有些茫然，他没想到，他居然穿越了，原本叶希文只是二十一世纪的地球上一个普通的大学生罢了，一个月了，他才后知后觉的反应过来，这不是有人和他进行恶作剧，而是，他真的穿越了。<br/>

Requests will automatically decode content from the server. Most unicode charsets are seamlessly decoded.

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property:
>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'
If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text.

抓取汉字 python

scraping chinese characters python

encoding

beautifulsoup

decoding

web-scraping

python-2.7