使用 python 突出显示 word 文档中的某些单词
Highlighting the certain words in a word document with python
我想突出显示给定word文档中的所有给定单词,但是,我只能突出显示句子中的第一个单词...
示例:假设我在一个word文档中有以下文字,我想突出显示以下文字:Approximate Pending May。虽然它在前四行中工作正常,但在第五行中我只能突出显示“待定”
Approximate
Pending spending
May
May
Pending Approximate xx sit May
这是我的代码,你能帮我解决这个问题吗?
from docx import Document
from docx.enum.text import WD_COLOR_INDEX
import pandas as pd
import os
import re
path = r"C:\Users\files\"
input_file = r"C:\Users\files\\Dictionary.xlsx"
# List of words
df = pd.read_excel(input_file)
my_list = df['dictionary'].tolist()
list_2 = [dic.capitalize() for dic in my_list]
list_3 = [dic.lower() for dic in my_list]
my_list.extend(list_2)
my_list.extend(list_3)
for filename in os.listdir(path):
if filename.endswith(".docx"):
file = "C:\Users\files\" + filename
print(file)
doc = Document(file)
for para in doc.paragraphs:
for phrase in my_list:
#start = para.text.find(phrase)
x = para.text
starts = re.findall('\b' + phrase + '\b', x)
#print(start)
if len(starts)>0:
#print(starts)
#if start > -1 :
start = para.text.find(phrase)
pre = para.text[:start]
post = para.text[start+len(phrase):]
para.text = pre
para.add_run(phrase)
para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW
para.add_run(post)
doc.save( file )
我想我找到了问题。
在这里,您将 纯 文本插入段落中,不带任何格式,这会覆盖之前在 pre
部分中完成的任何格式:
pre = para.text[:start]
...
para.text = pre
post
部分也是如此:
post = para.text[start + len(phrase):]
...
para.add_run(post)
它总是用 纯 文本覆盖在上一次迭代中完成的任何突出显示。 paragraph
的 text
属性 只给你一个没有格式的 string
。从 documentation of the text
property of the paragraph
object:
Assigning text to this property causes all existing paragraph content to be replaced with a single run containing the assigned text ... Paragraph-level formatting, such as style, is preserved. All run-level formatting, such as bold or italic, is removed.
如果您更改 my_list
中元素的顺序,您可以看到这一点。只会突出显示最后一个元素。我从
开始
my_list = ['approximate', 'may', 'pending']
结果突出显示 Pending
。然后我切换到
my_list = ['approximate', 'pending', 'may']
结果突出显示 May
。我在这里指的是您示例中的最后一行。
编辑:这是修复它的尝试。
我已经替换了
# List of words
df = pd.read_excel(input_file)
my_list = df['dictionary'].tolist()
list_2 = [dic.capitalize() for dic in my_list]
list_3 = [dic.lower() for dic in my_list]
my_list.extend(list_2)
my_list.extend(list_3)
和
# List of words
df = pd.read_excel(input_file)
my_list = df['dictionary'].tolist()
# Setup regex
patterns = [r'\b' + word + r'\b' for word in my_list]
re_highlight = re.compile('(' + '|'.join(p for p in patterns) + ')+',
re.IGNORECASE)
和
for filename in os.listdir(path):
if filename.endswith(".docx"):
file = "C:\Users\files\" + filename
print(file)
doc = Document(file)
for para in doc.paragraphs:
for phrase in my_list:
#start = para.text.find(phrase)
x = para.text
starts = re.findall('\b' + phrase + '\b', x)
#print(start)
if len(starts)>0:
#print(starts)
#if start > -1 :
start = para.text.find(phrase)
pre = para.text[:start]
post = para.text[start+len(phrase):]
para.text = pre
para.add_run(phrase)
para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW
para.add_run(post)
doc.save( file )
和
for filename in os.listdir(path):
if filename.endswith(".docx"):
file = "C:\Users\files\" + filename
print(file)
doc = Document(file)
for para in doc.paragraphs:
text = para.text
if len(re_highlight.findall(text)) > 0:
matches = re_highlight.finditer(text)
para.text = ''
p3 = 0
for match in matches:
p1 = p3
p2, p3 = match.span()
para.add_run(text[p1:p2])
run = para.add_run(text[p2:p3])
run.font.highlight_color = WD_COLOR_INDEX.YELLOW
para.add_run(text[p3:])
doc.save(file)
它适用于您提供的示例。但我不是regex-wiz,可能会有更简洁的解决方案
我想突出显示给定word文档中的所有给定单词,但是,我只能突出显示句子中的第一个单词...
示例:假设我在一个word文档中有以下文字,我想突出显示以下文字:Approximate Pending May。虽然它在前四行中工作正常,但在第五行中我只能突出显示“待定”
Approximate
Pending spending
May
May
Pending Approximate xx sit May
这是我的代码,你能帮我解决这个问题吗?
from docx import Document
from docx.enum.text import WD_COLOR_INDEX
import pandas as pd
import os
import re
path = r"C:\Users\files\"
input_file = r"C:\Users\files\\Dictionary.xlsx"
# List of words
df = pd.read_excel(input_file)
my_list = df['dictionary'].tolist()
list_2 = [dic.capitalize() for dic in my_list]
list_3 = [dic.lower() for dic in my_list]
my_list.extend(list_2)
my_list.extend(list_3)
for filename in os.listdir(path):
if filename.endswith(".docx"):
file = "C:\Users\files\" + filename
print(file)
doc = Document(file)
for para in doc.paragraphs:
for phrase in my_list:
#start = para.text.find(phrase)
x = para.text
starts = re.findall('\b' + phrase + '\b', x)
#print(start)
if len(starts)>0:
#print(starts)
#if start > -1 :
start = para.text.find(phrase)
pre = para.text[:start]
post = para.text[start+len(phrase):]
para.text = pre
para.add_run(phrase)
para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW
para.add_run(post)
doc.save( file )
我想我找到了问题。
在这里,您将 纯 文本插入段落中,不带任何格式,这会覆盖之前在 pre
部分中完成的任何格式:
pre = para.text[:start]
...
para.text = pre
post
部分也是如此:
post = para.text[start + len(phrase):]
...
para.add_run(post)
它总是用 纯 文本覆盖在上一次迭代中完成的任何突出显示。 paragraph
的 text
属性 只给你一个没有格式的 string
。从 documentation of the text
property of the paragraph
object:
Assigning text to this property causes all existing paragraph content to be replaced with a single run containing the assigned text ... Paragraph-level formatting, such as style, is preserved. All run-level formatting, such as bold or italic, is removed.
如果您更改 my_list
中元素的顺序,您可以看到这一点。只会突出显示最后一个元素。我从
my_list = ['approximate', 'may', 'pending']
结果突出显示 Pending
。然后我切换到
my_list = ['approximate', 'pending', 'may']
结果突出显示 May
。我在这里指的是您示例中的最后一行。
编辑:这是修复它的尝试。
我已经替换了
# List of words
df = pd.read_excel(input_file)
my_list = df['dictionary'].tolist()
list_2 = [dic.capitalize() for dic in my_list]
list_3 = [dic.lower() for dic in my_list]
my_list.extend(list_2)
my_list.extend(list_3)
和
# List of words
df = pd.read_excel(input_file)
my_list = df['dictionary'].tolist()
# Setup regex
patterns = [r'\b' + word + r'\b' for word in my_list]
re_highlight = re.compile('(' + '|'.join(p for p in patterns) + ')+',
re.IGNORECASE)
和
for filename in os.listdir(path):
if filename.endswith(".docx"):
file = "C:\Users\files\" + filename
print(file)
doc = Document(file)
for para in doc.paragraphs:
for phrase in my_list:
#start = para.text.find(phrase)
x = para.text
starts = re.findall('\b' + phrase + '\b', x)
#print(start)
if len(starts)>0:
#print(starts)
#if start > -1 :
start = para.text.find(phrase)
pre = para.text[:start]
post = para.text[start+len(phrase):]
para.text = pre
para.add_run(phrase)
para.runs[1].font.highlight_color = WD_COLOR_INDEX.YELLOW
para.add_run(post)
doc.save( file )
和
for filename in os.listdir(path):
if filename.endswith(".docx"):
file = "C:\Users\files\" + filename
print(file)
doc = Document(file)
for para in doc.paragraphs:
text = para.text
if len(re_highlight.findall(text)) > 0:
matches = re_highlight.finditer(text)
para.text = ''
p3 = 0
for match in matches:
p1 = p3
p2, p3 = match.span()
para.add_run(text[p1:p2])
run = para.add_run(text[p2:p3])
run.font.highlight_color = WD_COLOR_INDEX.YELLOW
para.add_run(text[p3:])
doc.save(file)
它适用于您提供的示例。但我不是regex-wiz,可能会有更简洁的解决方案