使用 python-docx 遍历 docx 中 Table 的内容

Question

我有一个文档，其中 table 的内容是在文档开头自动生成的，我想解析这个 table 的内容。这可能使用 python-docx 吗？如果我尝试遍历 doc.paragraphs.text，内容 table 中的文本不会显示。

我尝试了以下操作：遍历段落并检查 paragraph.style.name 是否为 toc 1 然后我知道我在 ToC 中。但我无法获得实际的文本。我试过这个：

if para.style.name == "toc 1" #then print para.text.

但是 para.text 给了我一个空字符串。为什么会这样？

谢谢

Answer 1

相信你会发现实际生成的TOC内容是"wrapped"在一个非段落元素中。 python-docx 不会直接将您带到那里，因为它只会找到 w:document/w:body 元素的直接子元素的段落。

要获得这些，您需要深入到 lxml 级别，使用 python-docx 使您尽可能接近。您可以通过以下方式获取（并打印）body 元素：

document = Document('my-doc.docx')
body_element = document._body._body
print(body_element.xml)  # this will be big if your document is

从那里您可以确定所需部分的具体 XML 位置，然后使用 lxml/XPath 访问它们。然后你可以将它们包装在 python-docx Paragraph 对象中以便于访问：

from docx.text.paragraph import Paragraph

ps = body_element.xpath('./w:something/w:something_child/w:p'
paragraphs = [Paragraph(p, None) for p in ps]

这不是一个确切的方法，您需要进行一些研究才能弄清楚 w:something 等是什么，但如果您希望它足够糟糕以克服这些障碍，这种方法会奏效。

一旦你开始工作，发布你的确切解决方案可能对其他人的搜索有帮助。

Answer 2

由于大部分解决方案都隐藏在评论部分，我花了一段时间才弄清楚 OP 到底做了什么，以及 scanny 的回答如何改变了他正在做的事情，我将 post 我的解决方案在这里，这只是写在scanny的回答的评论部分。我不完全理解代码是如何工作的，所以如果有人想编辑我的答案，请随时这样做。

#open docx file with python-docx
document = docx.Document("path\to\file.docx")
#extract body elements
body_elements = document._body._body
#extract those wrapped in <w:r> tag
rs = body_elements.xpath('.//w:r')
#check if style is hyperlink (toc)
table_of_content = [r.text for r in rs if r.style == "Hyperlink"]

table_of_content 将是一个列表，首先包含作为项目的编号，然后是标题。

使用 python-docx 遍历 docx 中 Table 的内容

Iterate through Table of Contents in docx using python-docx

python

python-docx