python 从 word docx 阅读 header

Question

我正在尝试使用 python-docx 和看门狗从 word 文档中读取 header。我正在做的是，每当创建或修改新文件时，脚本都会读取文件并获取 header 中的内容，但我得到的是

docx.opc.exceptions.PackageNotFoundError: Package not found at 'Test6.docx'

错误，我尝试了所有方法，包括将其作为流打开，但没有任何效果，是的，文档已填充。作为参考，这是我的代码。

**main.py**
    import time
    from watchdog.observers import Observer
    from watchdog.events import FileSystemEventHandler
    import watchdog.observers
    import watchdog.events
    import os
    import re
    import xml.dom.minidom
    import zipfile
    from docx import Document


    class Watcher:
        DIRECTORY_TO_WATCH = "/path/to/my/directory"

        def __init__(self):
            self.observer = Observer()

        def run(self):
            event_handler = Handler()
            self.observer.schedule(event_handler,path='C:/Users/abdsak11/OneDrive - Lärande', recursive=True)
            self.observer.start()
            try:
                while True:
                    time.sleep(5)
            except:
                self.observer.stop()
                print ("Error")

            self.observer.join()


    class Handler(FileSystemEventHandler):

        @staticmethod
        def on_any_event(event):
            if event.is_directory:
                return None

            elif event.event_type == 'created':
                # Take any action here when a file is first created.
                path = event.src_path
                extenstion = '.docx'
                base = os.path.basename(path)

                if extenstion in path:
                    print ("Received created event - %s." % event.src_path)
                    time.sleep(10)
                    print(base)
                    doc = Document(base)
                    print(doc)
                    section = doc.sections[0]
                    header = section.header
                    print (header)



            elif event.event_type == 'modified':
                # Taken any action here when a file is modified.
                path = event.src_path
                extenstion = '.docx'
                base = os.path.basename(path)
                if extenstion in base:
                    print ("Received modified event - %s." % event.src_path)
                    time.sleep(10)
                    print(base)
                    doc = Document(base)
                    print(doc)
                    section = doc.sections[0]
                    header = section.header
                    print (header)



    if __name__ == '__main__':
        w = Watcher()
        w.run()

编辑：试图将扩展名从 doc 更改为 docx，这很有效，但无论如何都可以打开 docx，因为这就是我所发现的。

另一件事。打开“.doc”文件并尝试读取 header 时，我得到的只是

<docx.document.Document object at 0x03195488>
<docx.section._Header object at 0x0319C088>

我想做的是从 header

中提取文本

Answer 1

您正在尝试打印对象本身，但是您应该访问它的属性:

...
doc = Document(base)
section = doc.sections[0]
header = section.header
print(header.paragraphs[0].text)

根据https://python-docx.readthedocs.io/en/latest/user/hdrftr.html)

更新

当我使用 python-docx 包时，结果发现 PackageNotFoundError 非常普遍，因为它可能发生只是因为文件由于某种原因无法访问 -不存在、未找到或由于权限，以及文件是否为空或损坏。例如，在看门狗的情况下，很可能会发生在触发 "created" 事件之后和创建 Document 文件之前可以重命名、删除等。出于某种原因你通过在创建 Document 之前等待 10 秒来增加这种情况的可能性？所以，尝试检查文件之前是否存在：

if not os.path.exists(base):
    raise OSError('{}: file does not exist!'.format(base))
doc = Document(base)

更新2

另请注意，当打开程序根据文件名创建一些锁定文件时可能会发生这种情况，例如运行您在 linux 上的代码并使用 libreoffice 打开文件导致

PackageNotFoundError: Package not found at '.~lock.xxx.docx#'

因为这个文件不是docx文件！所以你应该用

更新你的过滤条件

if path.endswith(extenstion):
...

python 从 word docx 阅读 header

python reading header from word docx

python

ms-word

python-docx

python-watchdog