如何使用 python-docx 从 docx 文档中提取索引标记数据？

Question

给定一个简单的段落block，我想从中提取索引标记数据。

像这样的简单代码：

print(block.text)

for run in block.runs:
    print(run)

将打印出段落文本和相关联的运行列表，其中一个（据我所知）包含一个特殊的 XE（索引条目）字段。

This is a test.
<docx.text.run.Run object at 0x7f800f369c50>
<docx.text.run.Run object at 0x7f800f369da0>
<docx.text.run.Run object at 0x7f800f369dd8>
<docx.text.run.Run object at 0x7f800f369c18>
<docx.text.run.Run object at 0x7f800f369e48>
<docx.text.run.Run object at 0x7f800f369eb8>
<docx.text.run.Run object at 0x7f800f369f28>

我需要从运行中提取数据，其中包含索引标记和运行在段落中的位置（即第 n 个字符）。

我在 python-docx 库中遗漏的 api 是否有帮助？或者，我应该解析原始 XML 吗？我怎样才能得到段落的原始 XML？

谢谢！！

Answer 1

为此，您可以下拉到 lxml/oxml 层。

您需要某种“外部”循环来跟踪当前偏移量。生成器函数可能对此很方便。

def iter_xe_runs_with_offsets(paragraph):
    """Generate (run, run_idx, text_offset) triples from `paragraph`."""
    text_offset = 0
    for run_idx, run in enumerate(paragraph.runs):
        if contains_index_marker(run):
            yield (run, run_idx, text_offset)
        text_offset += len(run.text)

然后一个处理方法可以用它来做需要的事情：

def process_paragraph(paragraph):
    for run, run_idx, text_offset in iter_xe_runs_with_offsets(paragraph):
        # ... do the needful ...

并且你需要一个辅助助手来判断运行是否有索引标记。这将在 run._r 运行元素对象上使用 lxml.etree._Element 方法。

def contains_index_marker(run):
    """Return True if `run` is marked as index entry."""
    r = run._r
    # ... use lxml on `r` to identify presence of "index marker"
    # the code to do that depends on whether it is an attribute or
    # child element.

如何使用 python-docx 从 docx 文档中提取索引标记数据？

How can I extract index marker data from a docx document using python-docx?

python

docx

python-docx