Python-docx:查找Word文档中所有占位符数字并用随机数替换

Python-docx: Find and replace all placeholder numbers in Word doc with random numbers

我在查找和替换 Word 文件段落中所有出现的多个占位符时遇到问题。这是一本游戏书,所以我正在尝试为起草本书时使用的占位符添加随机条目号。

所有占位符都以“#”开头(例如#1-5、#22-1 等)。设置数字,如第一个条目(始终为“1”),没有“#”前缀。通过在元组内压缩以供参考,占位符条目与随机对应项作为元组配对。

这一切都非常适合标题,因为它是按顺序直接 one-for-one 段落交换。问题是当我遍历常规段落(代码的倒数第二位)时。它似乎只替换前八个数字,然后停止。我试过设置一个循环,但似乎无济于事。不知道我错过了什么。代码如下。

编辑:以下是两个列表和引用元组的设置方式。在这个测试中,只设置了第一个条目,没有 in-paragraph 返回它的引用。所有其他的都将被随机化并替换 in-paragraph.

条目工作: ['#1-1', '#1-2', '#1-3', '#1-4', '#1-5', '#1-6', '#1-7', '#1-8', '#2', '#2-1', '#2-2', '#2-3', '#2-4', '#2-5', '#2 -6', '#2-7', '#16', '#17', '#3', '#3-1', '#3-2', '#3-3', '#3 -4', '#3-5', '#3-6', '#3-8', '#3-9']

条目数: ['2', '20', '12', '27', '23', '4', '11', '16', '26', '7', '25', '5', ' 3', '15', '17', '6', '18', '22', '10', '21', '19', '13', '28', '8', '14' , '9', '24']

参考: (('#1-1', '2'), ('#1-2', '20'), ('#1-3', '12'), ('#1-4', '27 '), ('#1-5', '23'), ('#1-6', '4'), ('#1-7', '11'), ('#1-8', '16'), ('#2', '26'), ('#2-1', '7'), ('#2-2', '25'), ('#2-3', '5'), ('#2-4', '3'), ('#2-5', '15'), ('#2-6', '17'), ('#2-7 ', '6'), ('#16', '18'), ('#17', '22'), ('#3', '10'), ('#3-1', '21 '), ('#3-2', '19'), ('#3-3', '13'), ('#3-4', '28'), ('#3-5', '8'), ('#3-6', '14'), ('#3-8', '9'), ('#3-9', '24'))

感谢协助。

import sys, os, random
from docx import *

entryWorking = [] # The placeholder entries created for the draft gamebook


# Identify all paragraphs with a specific heading style (e.g. 'Heading 2')
def iter_headings( paragraphs, heading ) :
    for paragraph in paragraphs :
        if paragraph.style.name.startswith( heading ) :
            yield paragraph


# Open the .docx file
document = Document( 'TestFile.docx' )


# Search document for unique placeholder entries (must have a unique heading style)
for heading in iter_headings( document.paragraphs, 'Heading 2' ) :
    entryWorking.append( heading.text )


# Create list of randomized gamebook entry numbers
entryNumbers = [ i for i in range( len ( entryWorking ) + 1 ) ]

# Remove unnecessary entry zero (extra added above to compensate)
entryNumbers.remove( 0 )

# Convert to strings
entryNumbers = [ str( x ) for x in entryNumbers ]


# Identify pre-set entries (such as Entry 1), and remove from both lists
# This avoids pre-set numbers being replaced (i.e. they remain as is in the .docx)
# Pre-set entry numbers must _not_ have the "#" prefix in the .docx
for string in entryWorking :
    if string[ 0 ] != '#' :
        entryWorking.remove( string )
        if string in entryNumbers :
            entryNumbers.remove( string )

# Shuffle new entry numbers
random.shuffle( entryNumbers )


# Create tuple list of placeholder entries paired with random entry
reference = tuple( zip( entryWorking, entryNumbers ) )


# Replace placeholder headings with assigned randomized entry
for heading in iter_headings( document.paragraphs, 'Heading 2' ) :
    for entry in reference :
        if heading.text == entry[ 0 ] :
            heading.text = entry[ 1 ]


# Search through paragraphs for placeholders and replace with randomized entry
for paragraph in document.paragraphs :
    for run in paragraph.runs :
        for entry in reference :
            if run.text == entry[ 0 ] :
                run.text = entry [ 1 ]

                        
# Save the new document with final entries
document.save('Output.docx')

在 Word 中,在文本中的任意位置运行中断:

  • .

您可能对此答案中的链接感兴趣,这些链接演示了在一般情况下执行此类操作所需的(异常复杂的)工作:

How to use python-docx to replace text in a Word document and save

There are a couple of paragraph-level functions that do a good job of this and can be found on the GitHub site for python-docx.

This one will replace a regex-match with a replacement str. The replacement string will appear formatted the same as the first character of the matched string.

This one will isolate a run such that some formatting can be applied to that word or phrase, like highlighting each occurence of "foobar" in the text or perhaps making it bold or appear in a larger font.

幸运的是,它通常是可复制粘贴的,效果很好:)

谢谢scanny的协助!

我在让它工作后发现的最后一个问题是在每个参考号后添加一个“#”后缀以确保它们是唯一的(例如,#2 的随机条目没有被替换为 #2-1 ).

下面的工作代码。

import sys, os, random, re
from docx import *



# Identify all paragraphs with a specific heading style (e.g. 'Heading 2')
def iter_headings( paragraphs, heading ) :
    for paragraph in paragraphs :
        if paragraph.style.name.startswith( heading ) :
            yield paragraph



def paragraph_replace_text( paragraph, regex, replace_str ) : # Credit to scanny on GitHub
    """Return `paragraph` after replacing all matches for `regex` with `replace_str`.

    `regex` is a compiled regular expression prepared with `re.compile(pattern)`
    according to the Python library documentation for the `re` module.
    """
    
    # --- a paragraph may contain more than one match, loop until all are replaced ---
    while True :
        text = paragraph.text
        
        match = regex.search( text )

        if not match :
            break


        # --- when there's a match, we need to modify run.text for each run that
        # --- contains any part of the match-string.
        runs = iter( paragraph.runs )
        start, end = match.start(), match.end()


        # --- Skip over any leading runs that do not contain the match ---
        for run in runs :
            run_len = len( run.text )

            if start < run_len :
                break

            start, end = start - run_len, end - run_len


        # --- Match starts somewhere in the current run. Replace match-str prefix
        # --- occurring in this run with entire replacement str.
        run_text = run.text

        run_len = len( run_text )

        run.text = "%s%s%s" % ( run_text[ :start ], replace_str, run_text[ end: ] )

        end -= run_len  # --- note this is run-len before replacement ---

        # --- Remove any suffix of match word that occurs in following runs. Note that
        # --- such a suffix will always begin at the first character of the run. Also
        # --- note a suffix can span one or more entire following runs.
        for run in runs :  # --- next and remaining runs, uses same iterator ---
            if end <= 0 :
                break

            run_text = run.text

            run_len = len( run_text )

            run.text = run_text[ end: ]

            end -= run_len

    # --- optionally get rid of any "spanned" runs that are now empty. This
    # --- could potentially delete things like inline pictures, so use your judgement.
    # for run in paragraph.runs :
    #     if run.text == "" :
    #         r = run._r
    #         r.getparent().remove( r )

    return paragraph


""" NOTE: Replace 'Doc.docx' with your filename """
# Open the .docx file
document = Document( 'Doc.docx' )


# Search document for unique placeholder entries (must have a unique heading style)
entryWorking = [] # The placeholder entries created for the draft gamebook


""" NOTE: Replace 'Heading 2' with your entry number header """
for heading in iter_headings( document.paragraphs, 'Heading 2' ) :
    entryWorking.append( heading.text )


# Create list of randomized gamebook entry numbers
entryNumbers = [ i for i in range( len ( entryWorking ) + 1 ) ]


# Remove unnecessary entry zero (extra added above to compensate)
entryNumbers.remove( 0 )


# Convert to strings
entryNumbers = [ str( x ) for x in entryNumbers ]


# Identify pre-set entries (such as Entry 1), and remove from both lists
# This avoids pre-set numbers being replaced (i.e. they remain as is in the .docx)
# Pre-set entry numbers must _not_ have the "#" prefix in the .docx
for string in entryWorking :
    if string[ 0 ] != '#' :
        entryWorking.remove( string )

        if string in entryNumbers :
            entryNumbers.remove( string )


# Shuffle new entry numbers
random.shuffle( entryNumbers )


# Create tuple list of placeholder entries paired with random entry
reference = tuple( zip( entryWorking, entryNumbers ) )


# Replace placeholder headings with assigned randomized entry
for heading in iter_headings( document.paragraphs, 'Heading 2' ) :
    for entry in reference :
        if heading.text == entry[ 0 ] :
            heading.text = entry[ 1 ]


for paragraph in document.paragraphs :
    for entry in reference :
        if entry[ 0 ] in paragraph.text :
            regex = re.compile( entry[ 0 ] )
            paragraph_replace_text(paragraph, regex, entry[ 1 ])

                        
# Save the new document with final entries
document.save('Output.docx')