遍历 pathlib 路径和 python-docx: zipfile.BadZipFile

Question

我的 python 技能有点生疏，因为我最近主要使用 Rstats。但是我运行进入以下问题，我的目标是我想递归迭代目录中的所有 .docx 文件并使用 python-docx 包更改一些核心属性。

对于循环，我首先用 pathlib 和 glob 创建了一个列表

from docx import Document
from docx.shared import Inches
import pathlib

# Reading the stats dir
root_dir = pathlib.Path(r"C:\some\Björn\PycharmProjects\mre_docx")
# Get all word files in the stats directory
files = [x for x in root_dir.glob("**/*.docx") if x.is_file()]
files

文件输出看起来不错。

[WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test1.docx'),
 WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test2.docx')]

当我现在想阅读包含列表的文档时，出现 zip 错误（请参阅下面的完整回溯）

document = Document(files[1])
Traceback (most recent call last):
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-26-482c5438fa33>", line 1, in <module>
    document = Document(files[1])
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\api.py", line 25, in Document
    document_part = Package.open(docx).main_document_part
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\package.py", line 128, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\pkgreader.py", line 32, in from_file
    phys_reader = PhysPkgReader(pkg_file)
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\phys_pkg.py", line 101, in __init__
    self._zipf = ZipFile(pkg_file, 'r')
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1257, in __init__
    self._RealGetContents()
  File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1324, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

然而，只有运行同一行代码，没有列表就可以正常工作（路径分隔符 / 和 r"\" 的差异除外，我认为这无关紧要，因为事实上列表包含 pathlib.Path 个对象）。

document = Document(pathlib.Path(r"C:\Users\Björn\PycharmProjects\mre_docx\test1.docx"))

编辑评论

我为这个mre一共创建了4个新的word文件。现在我在其中两个中输入了文本，两个是空的。令我惊讶的是，我发现空的会导致错误。

for file in files:
    try:
        document = Document(file)
    except:
        print(f"The file: {file} appears to be corrupted")

输出：

The file: C:\Users\Björn\PycharmProjects\mre_docx\new_file.docx appears to be corrupted
The file: C:\Users\Björn\PycharmProjects\mre_docx\test2.docx appears to be corrupted

面向未来读者的半解决方案

在对 Document("Path/to/file.docx") 的调用周围添加一个 try 和 except 块，并打印出函数失败的相应文件。在我的例子中，它只有几个，我可以轻松地手动编辑。

Answer 1

你没有做错，因为文档是空的所以你得到了这个错误。如果你打开这些文件键入一些东西，你不会得到任何错误。但根据https://python-docx.readthedocs.io/en/latest/user/documents.html

您可以使用不同的代码打开word文档。

第一个：

document = Document()
document.save(files[1])

第二个：

document = Document(files[1])
document.save(files[1])

此外，根据文档，您可以像打开文件一样打开它们：

with open(files[1], 'rb') as f:
    document = Document(f)

遍历 pathlib 路径和 python-docx: zipfile.BadZipFile

Iterate over pathlib paths and python-docx: zipfile.BadZipFile

python

python-docx

pathlib

编辑评论

面向未来读者的半解决方案