来自网站的文本显示为乱码而不是希伯来语

Question

我正在尝试从网站获取字符串。我使用 requests 模块发送 GET 请求。

text = requests.get("http://example.com") #send GET requests to the website
print text.text #print the variable

但是，由于某种原因，文本显示为乱码而不是希伯来语：

<div>
<p>×©×¨×ª</p>
</div>

当我用 Fiddler 嗅探流量或在浏览器中查看网站时，我看到的是希伯来语：

<div>
<p>שרת</p>
</div>

顺便说一下，html 代码包含定义编码的元标记，即 utf-8。我试图将文本编码为 utf-8 但它仍然是乱码。我尝试使用 utf-8 解码它，但它抛出 UnicodeEncodeError 异常。我声明我在脚本的第一行使用 utf-8 。此外，当我使用内置 urllib 模块发送请求时也会发生问题。

我阅读了 Unicode HOWTO，但仍然无法修复它。我也在这里阅读了很多主题（关于 UnicodeEncodeError 异常以及为什么希伯来语在 Python 中变成乱码）但我仍然无法修复它。

我在 Windows 机器上使用 Python 2.7.9。我运行我的脚本在 Python IDLE 中。

提前致谢。

Answer 1

服务器未正确声明编码。

>>> print u'×©×¨×ª'.encode('latin-1').decode('utf-8')
שרת

在访问text.text之前设置text.encoding。

text = requests.get("http://example.com") #send GET requests to the website
text.encoding = 'utf-8' # Correct the page encoding
print text.text #print the variable

来自网站的文本显示为乱码而不是希伯来语

Text from website appears as Gibberish instead of Hebrew

python

unicode

encoding

utf-8

decoding