如何使用 python 获取嵌套 table 中的所有文本？

Question

我必须从 word 文档中提取嵌套 table 中的所有文本（tables inside table inside table）。我无法使用 python-docx 来完成，可能是我知识不够。

请推荐一些代码示例。

Answer 1

python-docx 看起来更像是一个 write/modify docx 库，您可能想尝试 PyPDF2 https://pythonhosted.org/PyPDF2/。但是 table 里面的 table 我不是很明白我猜 table 嵌套在 word 文档中？如果是这种情况，只需使用 PyPDF2 阅读阅读并将您想要保留的单词放在 table 中。祝您阅读愉快。

Answer 2

您将需要某种递归。基本思路是：

def iter_paragraphs_of_tables(tables):
    for table in tables:
        for row in table.rows:
            for cell in row.cells:
                yield from cell.paragraphs
                yield from iter_paragraphs_of_tables(cell.tables)

for paragraph in iter_paragraphs_of_tables(document.tables):
    print(paragraph.text)

这是 Python3，如果您使用 Python2，则需要将 yield from 语句扩展为：

yield from cell.paragraphs
# --- becomes ---
for paragraph in cell.paragraphs:
    yield paragraph

如何使用 python 获取嵌套 table 中的所有文本？

How to get all the text in a nested table using python?

python

python-docx