使用 python-pptx 从 powerpoint 演示文稿中删除所有元数据

Remove all metadata from powerpoint presentation using python-pptx

我可以使用以下代码 remove/overwrite 一些元数据(存储在 core.xml 中的元数据):

def remove_metadata(prs):
    """Overwrites the metadata in core.xml however does not overwrite metadata which is stored in app.xml"""
    prs.core_properties.title = 'PowerPoint Presentation'
    prs.core_properties.last_modified_by = 'python-pptx'
    prs.core_properties.revision = 1
    prs.core_properties.modified = datetime.utcnow()
    prs.core_properties.subject = ''
    prs.core_properties.author = 'python-pptx'
    prs.core_properties.keywords = ''
    prs.core_properties.comments = ''
    prs.core_properties.created = datetime.utcnow()
    prs.core_properties.category = ''

prs = pptx.Presentation('my_pres.xml')
remove_metadata(prs)

这很有用 - 但 app.xml 中还存储了其他元数据,例如公司和经理。我还需要清除这些属性。使用 python-pptx 如何编辑 app.xml 文件?

我找到了解决办法。这不一定是处理此问题的理想方法,但似乎有效:

def remove_metadata_from_app_xml(prs):
    """There is currently no functionality for handling app.xml so 
    have to find the part and then alter its blob manually
    """
    package_parts = prs.part.package.parts
    for part in package_parts:
        if part.partname.endswith('app.xml'):
            app_xml_part = part
    app_xml = app_xml_part.blob.decode('utf-8')
    tags_to_remove = ('Company', 'Manager', 'HyperlinkBase')
    for tag in tags_to_remove:
        pattern = f'<{tag}>.*<\/{tag}>'
        app_xml = re.sub(pattern, '', app_xml)
    app_xml_part.blob = bytearray(app_xml, 'utf-8')