从 doc.element.iter() 获取 docx 元素

Question

问题：如何使用 child 对象（如下）实际获取段落或 table 对象？

这是基于找到的答案 here, which referenced docx Issue 40。

不幸的是，那里发布的 none 代码似乎与提交 e784a73 一起工作，但我能够通过检查代码（以及反复试验）来接近

我有以下...

def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph.
    """
    print(type(parent))
    if isinstance(parent, docx.document.Document):
        parent_elm = doc.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iter():
        if isinstance(child, docx.oxml.text.paragraph.CT_P):
            yield ("(paragraph)", child)
        elif isinstance(child, docx.oxml.table.CT_Tbl):
            yield ("(table)", child)

for i in iter_block_items(doc): 
    print(i)

这成功地遍历了元素，并给出了以下输出...

doc= <class 'docx.document.Document'>
<class 'docx.document.Document'>
('(table)', <CT_Tbl '<w:tbl>' at 0x10c9ce0e8>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9ceef8>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9ce0e8>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9cef98>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9ce0e8>)
('(table)', <CT_Tbl '<w:tbl>' at 0x10c9ceef8>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9cef48>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9cef48>)
('(paragraph)', <CT_P '<w:p>' at 0x10c9cef98>)

此时我只需要：每个段落的文本，以及 table 的 table 对象 - 这样我就可以遍历它的单元格。

但是 child.text（对于一个段落）不是 return 段落文本（就像在下面的示例中那样），因为 child 对象实际上不是段落对象，但是一个元素对象应该能够 'get' 它。

for para in doc.paragraphs:
    print(para.text)

编辑：

我试过：

yield child.text
(yields "None")

和

from docx.text import paragraph
yield paragraph(child)
(Errors with TypeError: 'module' object is not callable)

和

from docx.oxml.text import paragraph
yield paragraph(child)
(Errors with TypeError: 'module' object is not callable)

Answer 1

如果需要 API 属性和方法，则需要为每个元素实例化代理对象。那是那些人住的地方。

if isinstance(child, CT_P):
    yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
    yield Table(child, parent)

这将生成 Paragraph 和 Table 对象。一个 Paragraph 对象有一个 .text 属性。对于 table，您需要深入研究单元格。

你的代码得到的是底层 XML 元素对象，它使用低级 lxml 接口（实际上用所谓的 oxml and/or xmlchemy 接口），它的级别低于您可能想要的级别，除非您要扩展 Paragraph.

等代理对象的行为

从 doc.element.iter() 获取 docx 元素

Get docx element from doc.element.iter()

elements

paragraph

python-3.x

python-docx