Python docx - 修改运行以针对特定单词

Python docx - Modify runs to target specific words

我正在 python 中开发一个代码,用于在 docx 文件中搜索某些变量,例如找到单词“car”并用定义的颜色突出显示它。

我正在使用 docx 模块来识别和突出显示文本,我可以在 运行 级别应用更改 (run.font.highlight),但是由于 MS Word 将文本存储在 xml 文件跟踪所有的变化,我正在寻找的单词可以拆分成不同的 运行s 或者是长句子的一部分。 由于我的最终目标是针对一个或多个已定义的词,因此我正在努力实现这一目标 expected result。

我的主要想法是 运行 一个函数来“清理” 运行 或 xml 文件,将我的目标词隔离 运行然后可以突出显示,但我还没有找到任何关于此的文档,我担心会丢失字体属性、样式等...

这是我目前拥有的代码:

import docx
from docx.enum.text import WD_COLOR_INDEX
import re

doc = docx.Document('demo.docx')

words = {'car': 'RED',
         'bus': 'GREEN',
         'train station': 'BLUE'}

for word, color in words.items():
    w = re.compile(fr'\b{word}\b')
    
    for par in doc.paragraphs:
        for run in par.runs:
            s = re.findall(w, run.text)
            if s:
                run.font.highlight_color = getattr(WD_COLOR_INDEX, color)

doc.save('new.docx')

有没有人遇到过同样的问题或对不同的方法有想法?

谢谢

此函数可用于根据您从 paragraph.text 上的正则表达式匹配中获得的 match.start()match.end() 值在段落中隔离 运行。从那里您可以根据需要更改返回的 运行 的属性,而不会影响相邻的文本:

def isolate_run(paragraph, start, end):
    """Return docx.text.Run object containing only `paragraph.text[start:end]`.

    Runs are split as required to produce a new run at the `start` that ends at `end`.
    Runs are unchanged if the indicated range of text already occupies its own run. The
    resulting run object is returned.

    `start` and `end` are as in Python slice notation. For example, the first three
    characters of the paragraph have (start, end) of (0, 3). `end` is not the index of
    the last character. These correspond to `match.start()` and `match.end()` of a regex
    match object and `s[start:end]` of Python slice notation.
    """
    rs = tuple(paragraph._p.r_lst)

    def advance_to_run_containing_start(start, end):
        """Return (r_idx, start, end) triple indicating start run and adjusted offsets.

        The start run is the run the `start` offset occurs in. The returned `start` and
        `end` values are adjusted to be relative to the start of `r_idx`.
        """
        # --- add 0 at end so `r_ends[-1] == 0` ---
        r_ends = tuple(itertools.accumulate(len(r.text) for r in rs)) + (0,)
        r_idx = 0
        while start >= r_ends[r_idx]:
            r_idx += 1
        skipped_rs_offset = r_ends[r_idx - 1]
        return rs[r_idx], r_idx, start - skipped_rs_offset, end - skipped_rs_offset

    def split_off_prefix(r, start, end):
        """Return adjusted `end` after splitting prefix off into separate run.

        Does nothing if `r` is already the start of the isolated run.
        """
        if start > 0:
            prefix_r = copy.deepcopy(r)
            r.addprevious(prefix_r)
            r.text = r.text[start:]
            prefix_r.text = prefix_r.text[:start]
        return end - start

    def split_off_suffix(r, end):
        """Split `r` at `end` such that suffix is in separate following run."""
        suffix_r = copy.deepcopy(r)
        r.addnext(suffix_r)
        r.text = r.text[:end]
        suffix_r.text = suffix_r.text[end:]

    def lengthen_run(r, r_idx, end):
        """Add prefixes of following runs to `r` until `end` is reached."""
        while len(r.text) < end:
            suffix_len_reqd = end - len(r.text)
            r_idx += 1
            next_r = rs[r_idx]
            if len(next_r.text) <= suffix_len_reqd:
                # --- subsume next run ---
                r.text = r.text + next_r.text
                next_r.getparent().remove(next_r)
                continue
            if len(next_r.text) > suffix_len_reqd:
                # --- take prefix from next run ---
                r.text = r.text + next_r.text[:suffix_len_reqd]
                next_r.text = next_r.text[suffix_len_reqd:]

    r, r_idx, start, end = advance_to_run_containing_start(start, end)
    end = split_off_prefix(r, start, end)

    # --- if run is longer than isolation-range we need to split-off a suffix run ---
    if len(r.text) > end:
        split_off_suffix(r, end)
    # --- if run is shorter than isolation-range we need to lengthen it by taking text
    # --- from subsequent runs
    elif len(r.text) < end:
        lengthen_run(r, r_idx, end)

    return Run(r, paragraph)

做起来比想象的要复杂;当我开始研究它时,它肯定比我想象的要复杂。无论如何,它是不时派上用场的东西。