如何在 Python3 中将字符串从 cp1251 转换为 UTF-8？

Question

非常简单的 Python 3.6 脚本需要帮助。

首先，它从使用 cp1251 编码的老式服务器下载 HTML 文件。

然后我需要将文件内容放入一个UTF-8编码的字符串中。

这是我正在做的事情：

import requests
import codecs

#getting the file
ri = requests.get('http://old.moluch.ru/_python_test/0.html')

#checking that it's in cp1251
print(ri.encoding)

#encoding using cp1251
text = ri.text
text = codecs.encode(text,'cp1251')

#decoding using utf-8 - ERROR HERE!
text = codecs.decode(text,'utf-8')

print(text)

这是错误：

Traceback (most recent call last):
  File "main.py", line 15, in <module>
    text = codecs.decode(text,'utf-8')
  File "/var/lang/lib/python3.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 43: invalid continuation byte

如果有任何帮助，我将不胜感激。

Answer 1

您不需要执行 encoding/decoding。

"When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text"

所以这会起作用：

import requests

#getting the file
ri = requests.get('http://old.moluch.ru/_python_test/0.html')

text = ri.text

print(text)

对于非文本请求，您还可以访问字节形式的响应正文：

ri.content

请查看requests documentation

Answer 2

不确定您要做什么。

.text 是响应的文本，一个 Python 字符串。编码在 Python 字符串中不起任何作用。

编码仅在您有字节流要转换为字符串（或相反）时发挥作用。请求模块已经为您完成了。

import requests

ri = requests.get('http://old.moluch.ru/_python_test/0.html')
print(ri.text)

例如，假设您有一个文本文件（即：字节）。然后，当您 open() 文件时，您必须选择一种编码 - 编码的选择决定了文件中的字节如何转换为字符。这个手动步骤是必要的，因为 open() 无法知道文件字节的编码。

另一方面，

HTTP 在响应 headers (Content-Type) 中发送此信息，因此 requests 可以知道此信息。作为一个 high-level 模块，它有助于查看 HTTP headers 并为您转换传入的字节。（如果你要使用更多 low-level urllib，你必须自己解码。）

当您使用响应的 .text 时，.encoding 属性纯粹是信息性的。不过，如果您使用 .raw 属性，它可能是相关的。对于使用 return 常规文本响应的服务器，很少需要使用 .raw。

Answer 3

您可以通过向解码函数添加设置来简单地忽略错误：

text = codecs.decode(text,'utf-8',errors='ignore')

Answer 4

当许多人已经回答说您在 requests.get 时收到了解码的消息。我会回答你现在面临的错误。

这一行：

text = codecs.encode(text,'cp1251')

将文本编码为 cp1251，然后您尝试使用 utf-8 对其进行解码，这会在此处给出错误：

text = codecs.decode(text,'utf-8')

检测类型你可以使用：

import chardet
text = codecs.encode(text,'cp1251')
chardet.detect(text) . #output {'encoding': 'windows-1251', 'confidence': 0.99, 'language': 'Russian'}

#OR
text = codecs.encode(text,'utf-8')
chardet.detect(text) . #output {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

因此以一种格式编码然后以其他格式解码会导致错误。

如何在 Python3 中将字符串从 cp1251 转换为 UTF-8？

How to convert a string from cp1251 to UTF-8 in Python3?

python

utf-8

cp1251

python-3.x