在保留样式 docx 库的同时替换段落中的字符串

Question

我正在替换word文档中表格和段落中的字符串。然而风格改变了。如何保持原来的样式格式？

with open(r"C:\Users\y.Israfilbayov\Desktop\testfiles\test_namedranges\VariableNames.json") as p:
                data = json.load(p)

document = Document(r"C:\Users\y.Israfilbayov\Desktop\testfiles\test_namedranges_update\F10352-JB117-FMXXX Pile XXXX As-built Memo GAIA Auto trial_v6.docx")

for key, value in data.items():
    for paragraph in document.paragraphs:
        if key in paragraph.text:
            paragraph.text = paragraph.text.replace(str(key), str(value))
for key, value in data.items():
    for table in document.tables:
        for row in table.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    if key in paragraph.text:
                        paragraph.text = paragraph.text.replace(str(key),str(value))

有一个类似的post，但是它对我没有帮助（也许我做错了什么）。

Answer 1

在位于的 docx 库文档中 https://python-docx.readthedocs.io/en/latest/api/text.html#paragraph-objects，它说明了以下关于为 paragraph.text 赋值的内容：

“为此属性分配文本会导致所有现有段落内容被替换为包含分配文本的单个运行。... 段落级格式，例如作为样式，将保留。 删除所有运行级格式，例如粗体或斜体。 "

你观察到的风格变化是否与此一致？
如果是这样，那么您可能正在丢失具有特定样式的“运行”对象，它们是段落对象的子对象。在这种情况下，您最好在循环中添加另一个级别以遍历所有 paragraph.runs 并单独替换那些 .

上的文本

例如，一旦你有了段落，那么

for run in paragraph.runs:
    if key in run.text:
        run.text = run.text.replace(str(key), str(value))

Answer 2

@property
def text(self):
    """
    String formed by concatenating the text of each run in the paragraph.
    Tabs and line breaks in the XML are mapped to ``\t`` and ``\n``
    characters respectively.

    Assigning text to this property causes all existing paragraph content
    to be replaced with a single run containing the assigned text.
    A ``\t`` character in the text is mapped to a ``<w:tab/>`` element
    and each ``\n`` or ``\r`` character is mapped to a line break.
    Paragraph-level formatting, such as style, is preserved. All
    run-level formatting, such as bold or italic, is removed.
    """
    text = ''
    for run in self.runs:
        text += run.text
    return text

从 documentation 看来，样式应该保持不变； bold/italic 可以删除格式。

如果这是您要保留的格式，您可能需要先确定运行密钥所在的内容，然后再对其进行修改。

Answer 3

这应该可以满足您的需求。需要 docx2python 2.0.0+

from docx2python.utilities import replace_docx_text

replace_docx_text(
    input_filename,
    output_filename,
    ("Apples", "Bananas"),  # replace Apples with Bananas
    ("Pears", "Apples"),  # replace Pears with Apples
    ("Bananas", "Pears"),  # replace Bananas with Pears
    html=True,
)

如果替换字符串包含制表符或符号，您可能会遇到问题，但“常规”文本替换将起作用并保留 大多数[1] 格式。

为实现这一点，docx2python 将不会替换格式更改的文本字符串，例如“此字符串的一部分 为粗体”，除非您指定 html=False，在无论格式如何，都会替换 which case 字符串，并且会丢失一些格式。

[1] 将保留以下内容：

斜体
加粗
下划线
罢工
上标
下标
小型大写字母
全部大写
突出显示
字体大小
彩色文字
（还有一些，但不保证）

编辑后续问题，如何替换表格中的标记文本？

我这样做的工作流程是将所有格式保留在 Word 中。也就是说，我在 Word 中创建一个模板，切出我需要的上下文，然后像拼图一样将所有内容重新组合在一起。

这个 github“项目”是一个示例（一个文件），说明我如何替换表格中的文本（表格可以是任意大小）。

https://github.com/ShayHill/replace_docx_tables

在保留样式 docx 库的同时替换段落中的字符串

Replace string in paragraph while keeping style docx library

python

docx