如何确定（下载的）字节串在 python 中的编码方式？

Question

我正在尝试下载文件并将其写入磁盘，但不知何故我迷失在编码解码领域。

from urllib.request import urlopen
url = "http://export.arxiv.org/e-print/supr-con/9608001"
with urllib.request.urlopen(url) as response:
    data = response.read()
    filename = 'test.txt'
    file_ = open(filename, 'wb')
    file_.write(data)
    file_.close()

这里的数据是一个字节串。如果我检查文件，我会发现一堆奇怪的字符。我试过了

import chardet
the_encoding = chardet.detect(data)['encoding']

但这会导致 None。所以我真的不知道我下载的数据是怎么编码的？

如果我只是在浏览器中输入“http://export.arxiv.org/e-print/supr-con/9608001”，它会下载一个我可以用文本编辑器查看的文件，它是一个非常好的 .tex 文件.

Answer 1

应用python-magic library.

python-magic is a Python interface to the libmagic file type identification library. libmagic identifies file types by checking their headers according to a predefined list of file types. This functionality is exposed to the command line by the Unix command file.

已评论 脚本（适用于 Windows 10、Python 3.8.6）：

# stage #1: read raw data from a url
from urllib.request import urlopen
import gzip
url = "http://export.arxiv.org/e-print/supr-con/9608001"
with urlopen(url) as response:
    rawdata = response.read()

# stage #2: detect raw data type by its signature
print("file signature", rawdata[0:2])
import magic
print( magic.from_buffer(rawdata[0:1024]))

# stage #3: decompress raw data and write to a file
data = gzip.decompress(rawdata)
filename = 'test.tex'
file_ = open(filename, 'wb')
file_.write(data)
file_.close()

# stage #4: detect encoding of the data ( == encoding of the written file)
import chardet
print( chardet.detect(data))

结果: .\SO307124.py

file signature b'\x1f\x8b'
gzip compressed data, was "9608001.tex", last modified: Thu Aug  8 04:57:44 1996, max compression, from Unix
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

如何确定（下载的）字节串在 python 中的编码方式？

How to determine how a (downloaded) byte string is encoded in python?

python

encode

decode