doc 或 docx：是否有安全的方法来识别 'requests' 中 python3 的类型？

Question

1) 如何区分 doc 和 docx 文件与请求？

a) 例如，如果我有

url='https://www.iadb.org/Document.cfm?id=36943997'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])

我明白了：

application/vnd.openxmlformats-officedocument.wordprocessingml.document

此文件是 docx。

b) 如果我有

url='https://www.iadb.org/Document.cfm?id=36943972'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])

我明白了

application/msword

此文件是文档。

2) 还有其他选择吗？

3) 如果我将 docx 文件另存为 doc，反之亦然，我是否会遇到识别问题（例如，转换为 pdf？）？是否有任何处理此问题的最佳实践？

Answer 1

您得到的 mime headers 似乎是正确的：What is a correct mime type for docx, pptx etc?

但是，发送软件只能发送用户选择的文件——而且仍然有很多人发送错误扩展名的文件。有些软件可以处理这个问题，有些则不能。要查看实际效果，请将 PNG 图像的名称更改为以 JPEG 结尾。我刚刚在我的 Mac 上做了，预览仍然可以打开它。当我在 Finder 中按 ⌘+I 时，它说它是 JPEG 文件，但在预览中打开时，它被正确识别为 "Portable Network Graphics" 文件。（您的 OS 可能会也可能不会。）

但是下载文件后，即使作者弄错了扩展名，您也可以明确区分 DOC 和 DOCX 文件。

DOC 文件以 Microsoft OLE Header 开头，这是相当复杂的结构。另一方面，DOCX 文件是一种复合文件格式，包含许多较小的 XML 文件，使用标准 ZIP 文件压缩将它们压缩在一起。因此，此文件类型 always 将以两个字符 PK.

开头

此检查与 Python 2.7 和 3.x 兼容（只有一个需要 decode）：

import sys

if len(sys.argv) == 2:
    print ('testing file: '+sys.argv[1])
    with open(sys.argv[1], 'rb') as testMe:
        startBytes = testMe.read(2).decode('latin1')
        print (startBytes)
        if startBytes == 'PK':
            print ('This is a DOCX document')
        else:
            print ('This is a DOC document')

从技术上讲，它会自信地为任何不以 PK 开头的内容声明 "This is a DOC document"，相反，它会为任何压缩文件（甚至纯文本）声明 "This is a DOCX document"恰好以这两个字符开头的文件）。因此，如果您根据此决定进一步处理该文件，您可能会发现它毕竟不是 Microsoft Word 文档。但至少你会尝试使用合适的解码器。

doc 或 docx：是否有安全的方法来识别 'requests' 中 python3 的类型？

doc or docx: Is there safeway to identify the type from 'requests' in python3?

ms-word

doc

docx

python-requests