Python 由于 unicode 编码错误，脚本在下载的文件上阻塞

Question

我每天运行编写 4 次脚本，使用请求模块下载文件，然后将其放入数据库。 10 次中有 9 次，脚本运行完美。但有时它不起作用是因为下载文件中的字符是我的脚本不喜欢的。例如，这是我今天遇到的错误：UnicodeEncodeError: 'ascii' codec can't encode characters in position 379-381: ordinal not in range(128)。我以另一种方式下载了文件，这是位置 380 处的字符，我认为它负责停止我的脚本，“∞”。而且，这是我的脚本中令人窒息的地方：

##### request file

r = requests.get('https://resources.example.com/requested_file.csv')

##### create the database importable csv file

ld = open('/requested_file.csv', 'w')
print(r.text, file=ld)

我知道这可能与在将文件打印到 .csv 文件之前以某种方式对文件进行编码有关，对于知道自己在做什么的人来说这可能是一件简单的事情，但是经过许多小时的研究，我我快要哭了。提前感谢您的帮助！

Answer 1

您需要为您的文件提供编码；目前它默认为 ASCII，这是一种非常有限的编解码器。

您可以改用 UTF-8，例如：

with open('/requested_file.csv', 'w', encoding='utf8') as ld:
    print(r.text, file=ld)

但是，由于您是从 URL 加载的，所以您现在正在解码然后再次编码。一个更好的主意是直接将数据以字节形式复制到磁盘。发出 streaming 请求并让 shutil.copyfileobj() 以块的形式复制数据。这样你就可以处理任何大小的响应而无需将所有内容加载到内存中：

import requests
import shutil

r = requests.get('https://resources.example.com/requested_file.csv', stream=True)
with open('/requested_file.csv', 'wb') as ld:
    r.raw.decode_content = True  # decompress gzip or deflate responses
    shutil.copyfileobj(r.raw, ld)

Answer 2

我尝试了很多不同的方法，但最终对我有用的是：

import requests
import io

##### request file

r = requests.get('https://resources.example.com/requested_file.csv')

##### create the db importable csv file

with open('requested_file_TEMP.csv', 'wb') as ld:
ld.write(r.text.encode())
ld.close()

##### run the temp file through the following code to get rid of any non-ascii characters
##### in the file; non-ascii characters can/will cause the script to choke

with io.open('requested_file_TEMP.csv', 'r',encoding='utf-8',errors='ignore') as infile, \
io.open('requested_file_TEMP.csv', 'w',encoding='ascii',errors='ignore') as outfile:
for line in infile:
    print(*line.split(), file=outfile)
infile.close
outfile.close

Python 由于 unicode 编码错误，脚本在下载的文件上阻塞

Python script chokes on a downloaded file because of unicode encode error

python

unicode

encode

python-requests