如何从 DOC（不是 DOCX）获取 XML？

How to get XML from DOC (not DOCX)?

对于 DOCX 文档，我这样做：

document = zipfile.ZipFile(path)
soup = BeautifulSoup(document.read('word/document.xml'), 'html.parser')

如何为DOC文档做这个？

你不知道。

DOCX 很难处理，它们基于 XML 并由国际标准组织记录。 DOC 文件是二进制的和专有的。

不要尝试直接处理 DOC 文件。 先将它们转换为 DOCX。

参见：

Automation: how to automate transforming .doc to .docx?
multiple .doc to .docx file conversion using python
Python & MS Word: Convert .doc to .docx?