python-docx：在保持顺序的同时遍历段落、表格和图像

Question

这是我第一次在这里发帖，我想写一个脚本，将docx作为输入并选择某些段落（包括表格和图像）以相同的顺序复制到另一个模板文档（不是最后).我遇到的问题是，当我开始遍历元素时，我的代码无法检测到图像，因此我无法确定图像相对于文本和表格的位置，也无法确定图像是哪个图像。简而言之，我得到了 doc1：文本图片文本 TABLE 文本

我最终得到的是：文本 [图像丢失] 文本 TABLE 文本

到目前为止我得到了什么：

-我可以遍历段落和表格：

def iter_block_items(parent):
"""
Generate a reference to each paragraph and table child within *parent*,
in document order. Each returned value is an instance of either Table or
Paragraph. *parent* would most commonly be a reference to a main
Document object, but also works for a _Cell object, which itself can
contain paragraphs and tables.
"""
if isinstance(parent, _Document):
    parent_elm = parent.element.body
    # print(parent_elm.xml)
elif isinstance(parent, _Cell):
    parent_elm = parent._tc
else:
    raise ValueError("something's not right")

for child in parent_elm.iterchildren():
    if isinstance(child, CT_P):
        yield Paragraph(child, parent)
    elif isinstance(child, CT_Tbl):
        yield Table(child, parent)

我可以获得文档图像的有序列表：

pictures = []
for pic in dwo.inline_shapes:
    if pic.type == WD_INLINE_SHAPE.PICTURE:
        pictures.append(pic)

我可以在段落末尾插入特定图片：

def insert_picture(index, paragraph):
    inline = pictures[index]._inline
    rId = inline.xpath('./a:graphic/a:graphicData/pic:pic/pic:blipFill/a:blip/@r:embed')[0]
    image_part = dwo.part.related_parts[rId]
    image_bytes = image_part.blob
    image_stream = BytesIO(image_bytes)
    paragraph.add_run().add_picture(image_stream, Inches(6.5))
    return

我这样使用函数 iter_block_items()：

start_copy = False
for block in iter_block_items(document):
    if isinstance(block, Paragraph):
        if block.text == "TEXT FROM WHERE WE STOP COPYING":
            break

    if start_copy:
        if isinstance(block, Paragraph):
            last_paragraph = insert_paragraph_after(last_paragraph,block.text)

        elif isinstance(block, Table):
            paragraphs_with_table.append(last_paragraph)
            tables_to_apppend.append(block._tbl)

    if isinstance(block, Paragraph):
        if block.text == ""TEXT FROM WHERE WE START COPYING":
            start_copy = True

Answer 1

我找到了一种方法，结果我想要排序的图像已经在段落中 inline.shape。我用这个：link 来提取图像，然后使用

的修改版本插入它们

def insert_picture(index, paragraph):

我会使用 rId 而不是索引。

Answer 2

您可以在以下 link:

中找到与此完全相同的工作实现

Extracting paras, tables and images in document order

Answer 3

这里（至少）有两种可能性：使用 xml（或 lxml）或使用 ready-made 替代 Python 模块。

备用 Python 模块（即不是 python-docx）是 docx2python。你这样使用它：

docx_obj = docx2python(path)
body = docx_obj.body

body 中的结构确实确实包含了正确顺序的文本和表格，python-docx 无法做到这一点（非常糟糕的缺陷）。

这个 dox2python 项目似乎还活着，尽管作者在 above-linked 页面上说他“在 2022 年不会编写太多代码”。据我所知，它似乎工作正常。请务必阅读有关如何将表格和 non-table 文本创建为结构的注释。

页面底部有一些非常值得一读的内容，说明为什么他的版本 2 比版本 1 更好。我还没有检查他是否确实实现了这一点，但如果是的话，这意味着它实际上在某些方面优于下面的替代“纯 lxml”解决方案（例如连续运行和链接）。

还有第二种拆解Word文档的方法：Word文档其实就是一个.zip文件，里面有各种组件。例如，这是计算段落的一种方法。

from lxml import etree
WORD_SCHEMA_STRING = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
with open(file_path, 'rb') as f:
    zip_file = zipfile.ZipFile(f)
    xml_content_bytes = zip_file.read('word/document.xml')
    doc_content_xml_tree_source = etree.fromstring(xml_content_bytes)
    for i_node, node in enumerate(doc_content_xml_tree_source.iter(tag=etree.Element)):
        if node.tag == WORD_SCHEMA_STRING + 'p':    
            n_paras += 1

您基本上需要做一些探索，看看“document.xml”是如何组合在一起的……并注意该 zip 文件中还有其他各种重要文档。但是使用上述技术，您可以公开所有 xml 节点，让您可以自由地做任何您需要做的事情。

我不确定您是否需要外部包 lxml（即而不是 xml）。我想我在某处读到后者的速度大大提高了。但我使用 lxml 因为我认为它可能仍然比标准库 xml 包快得多。

python-docx：在保持顺序的同时遍历段落、表格和图像

python-docx: iterate through paragraphs, tables and images while keeping order

python

docx