使用 python 突出显示 word 文档中的某些单词

Highlighting the certain words in a word document with python

我想突出显示给定word文档中的所有给定单词,但是,我只能突出显示句子中的第一个单词...

示例:假设我在一个word文档中有以下文字,我想突出显示以下文字:Approximate Pending May。虽然它在前四行中工作正常,但在第五行中我只能突出显示“待定”

Approximate 
Pending spending 
May
May
Pending Approximate xx sit May 

这是我的代码,你能帮我解决这个问题吗?

from docx import Document
from docx.enum.text import WD_COLOR_INDEX
import pandas as pd
import os
import re


path = r"C:\Users\files\"
input_file =  r"C:\Users\files\\Dictionary.xlsx"


# List of words
df = pd.read_excel(input_file) 
my_list = df['dictionary'].tolist()
list_2 = [dic.capitalize() for dic in my_list]
list_3 = [dic.lower() for dic in my_list]
my_list.extend(list_2)
my_list.extend(list_3)


for filename in os.listdir(path):
    if filename.endswith(".docx"):
        file = "C:\Users\files\" + filename
        print(file)
        doc = Document(file)
        for para in doc.paragraphs:
            for phrase in my_list:
                #start = para.text.find(phrase)
                x = para.text
                starts = re.findall('\b' + phrase + '\b', x)
                #print(start)
                if len(starts)>0:
                    #print(starts)
                    #if start > -1 :
                    start = para.text.find(phrase)
                    pre = para.text[:start]
                    post = para.text[start+len(phrase):]
                    para.text = pre
                    para.add_run(phrase)
                    para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW
                    para.add_run(post)
          
        doc.save( file )

我想我找到了问题。

在这里,您将 文本插入段落中,不带任何格式,这会覆盖之前在 pre 部分中完成的任何格式:

pre = para.text[:start]
...
para.text = pre

post 部分也是如此:

post = para.text[start + len(phrase):]
...
para.add_run(post)

它总是用 文本覆盖在上一次迭代中完成的任何突出显示。 paragraphtext 属性 只给你一个没有格式的 string。从 documentation of the text property of the paragraph object:

Assigning text to this property causes all existing paragraph content to be replaced with a single run containing the assigned text ... Paragraph-level formatting, such as style, is preserved. All run-level formatting, such as bold or italic, is removed.

如果您更改 my_list 中元素的顺序,您可以看到这一点。只会突出显示最后一个元素。我从

开始
my_list = ['approximate', 'may', 'pending']

结果突出显示 Pending。然后我切换到

my_list = ['approximate', 'pending', 'may']

结果突出显示 May。我在这里指的是您示例中的最后一行。

编辑:这是修复它的尝试。

我已经替换了

# List of words
df = pd.read_excel(input_file) 
my_list = df['dictionary'].tolist()
list_2 = [dic.capitalize() for dic in my_list]
list_3 = [dic.lower() for dic in my_list]
my_list.extend(list_2)
my_list.extend(list_3)

# List of words
df = pd.read_excel(input_file) 
my_list = df['dictionary'].tolist()

# Setup regex
patterns = [r'\b' + word + r'\b' for word in my_list]
re_highlight = re.compile('(' + '|'.join(p for p in patterns) + ')+',
                          re.IGNORECASE)

for filename in os.listdir(path):
    if filename.endswith(".docx"):
        file = "C:\Users\files\" + filename
        print(file)
        doc = Document(file)
        for para in doc.paragraphs:
            for phrase in my_list:
                #start = para.text.find(phrase)
                x = para.text
                starts = re.findall('\b' + phrase + '\b', x)
                #print(start)
                if len(starts)>0:
                    #print(starts)
                    #if start > -1 :
                    start = para.text.find(phrase)
                    pre = para.text[:start]
                    post = para.text[start+len(phrase):]
                    para.text = pre
                    para.add_run(phrase)
                    para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW
                    para.add_run(post)
          
        doc.save( file )

for filename in os.listdir(path):
    if filename.endswith(".docx"):
        file = "C:\Users\files\" + filename
        print(file)
        doc = Document(file)
        for para in doc.paragraphs:
            text = para.text
            if len(re_highlight.findall(text)) > 0:
                matches = re_highlight.finditer(text)
                para.text = ''
                p3 = 0
                for match in matches:
                    p1 = p3
                    p2, p3 = match.span()
                    para.add_run(text[p1:p2])
                    run = para.add_run(text[p2:p3])
                    run.font.highlight_color = WD_COLOR_INDEX.YELLOW
                para.add_run(text[p3:])
        doc.save(file)

它适用于您提供的示例。但我不是regex-wiz,可能会有更简洁的解决方案