使用 python-docx 在文本文件中突出显示或加粗字符串?
Highlight or bold strings in a text file using python-docx?
我有一个'short strings'的列表,比如:
['MKWVTFISLLLLFSSAYSRGV', 'SSAYSRGVFRRDTHKSEIAH', 'KPKATEEQLKTVMENFVAFVDKCCA']
我需要匹配包含在 word 文件 (BSA.docx) 或 .txt 文件(无关紧要)中的 'long string',例如:
sp|P02769|ALBU_BOVIN Albumin OS=Bos taurus OX=9913 GN=ALB PE=1 SV=4
MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCVADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPKLKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKGACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLVTDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEKDAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEATLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKVPQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTKCCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKHKPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA
我想使用 python(在终端或 jupyter 笔记本中)获得以下内容:
在长字符串中突出显示较短的字符串匹配项。高亮样式不重要,可以用黄色标记高亮或者加粗,或者下划线,跳到眼睛里看有没有匹配的都可以。
求长字符串的覆盖率为((突出显示的字符数)/(长字符串的总长度))*100。请注意,长字符串的第一行以“>>”开头的只是一个标识符,需要忽略。
这是第一个任务的当前代码:
from docx import Document
doc = Document('BSA.docx')
peptide_list = ['MKWVTFISLLLLFSSAYSRGV', 'SSAYSRGVFRRDTHKSEIAH', 'KPKATEEQLKTVMENFVAFVDKCCA']
def highlight_peptides(text, keywords):
text = text.paragraphs[1].text
replacement = "3[91m" + "\1" + "3[39m"
enter code here`text = re.sub("(" + "|".join(map(re.escape, keywords)) + ")", replacement, text, flags=re.I)
highlight_peptides(doc, peptide_list)
问题是列表中的前两个短字符串重叠,在结果中只有第一个在序列中以红色突出显示。
请参阅下面的第一个 link,其中包含我正在获取的输出结果。
current result
查看第二个 link 以可视化我的 'ideal' 结果。
ideal result
在理想中,我还包括了第二个任务,即查找序列覆盖率。我不确定如何计算彩色或突出显示的字符。
您可以使用 third-party regex
module 进行重叠关键字搜索。然后,分两次完成匹配可能是最简单的:(1) 存储每个突出显示的段的开始和结束位置并组合任何重叠的部分:
import regex as re # important - not using the usual re module
def find_keywords(keywords, text):
""" Return a list of positions where keywords start or end within the text.
Where keywords overlap, combine them. """
pattern = "(" + "|".join(re.escape(word) for word in keywords) + ")"
r = []
for match in re.finditer(pattern, text, flags=re.I, overlapped=True):
start, end = match.span()
if not r or start > r[-1]:
r += [start, end] # add new segment
elif end > r[-1]:
r[-1] = end # combine with previous segment
return r
positions = find_keywords(keywords, text)
您的 'keyword coverage'(突出显示的百分比)可以计算为:
coverage = sum(positions[1::2]) - sum(positions[::2]) # sum of end positions - sum of start positions
percent_coverage = coverage * 100 / len(text)
然后 (2) 为文本添加格式,使用 run
properties in docx
:
import docx
def highlight_sections_docx(positions, text):
""" Add characters to a text to highlight the segments indicated by
a list of alternating start and end positions """
document = docx.Document()
p = document.add_paragraph()
for i, (start, end) in enumerate(zip([None] + positions, positions + [None])):
run = p.add_run(text[start:end])
if i % 2: # odd segments are highlighted
run.bold = True # or add other formatting - see https://python-docx.readthedocs.io/en/latest/api/text.html#run-objects
return document
doc = highlight_sections_docx(positions, text)
doc.save("my_word_doc.docx")
或者,您可以突出显示 html 中的文本,然后使用 htmldocx
程序包将其保存到 Word 文档:
def highlight_sections(positions, text, start_highlight="<mark>", end_highlight="</mark>"):
""" Add characters to a text to highlight the segments indicated by
a list of alternating start and end positions """
r = ""
for i, (start, end) in enumerate(zip([None] + positions, positions + [None])):
if i % 2: # odd segments are highlighted
r += start_highlight + text[start:end] + end_highlight
else: # even segments are not
r += text[start:end]
return r
from htmldocx import HtmlToDocx
s = highlight_sections(positions, text, start_highlight="<strong>", end_highlight="</strong>")
html = f"""<html><head></head><body><span style="width:100%; word-wrap:break-word; display:inline-block;">{s}</span></body></html>"""
HtmlToDocx().parse_html_string(html).save("my_word_doc.docx")
(<mark>
比 <strong>
更适合使用 html 标签,但不幸的是,HtmlToDocx 不保留 <mark>
的任何格式,并忽略 CSS 样式)。
highlight_sections
也可以用来输出到控制台:
print(highlight_sections(positions, text, start_highlight="3[91m", end_highlight="3[39m"))
... 或到 Jupyter / IPython 笔记本:
from IPython.core.display import HTML
s = highlight_sections(positions, text)
display(HTML(f"""<span style="width:100%; word-wrap:break-word; display:inline-block;">{s}</span>""")
我有一个'short strings'的列表,比如:
['MKWVTFISLLLLFSSAYSRGV', 'SSAYSRGVFRRDTHKSEIAH', 'KPKATEEQLKTVMENFVAFVDKCCA']
我需要匹配包含在 word 文件 (BSA.docx) 或 .txt 文件(无关紧要)中的 'long string',例如:
sp|P02769|ALBU_BOVIN Albumin OS=Bos taurus OX=9913 GN=ALB PE=1 SV=4 MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCVADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPKLKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKGACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLVTDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEKDAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEATLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKVPQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTKCCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKHKPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA
我想使用 python(在终端或 jupyter 笔记本中)获得以下内容:
在长字符串中突出显示较短的字符串匹配项。高亮样式不重要,可以用黄色标记高亮或者加粗,或者下划线,跳到眼睛里看有没有匹配的都可以。
求长字符串的覆盖率为((突出显示的字符数)/(长字符串的总长度))*100。请注意,长字符串的第一行以“>>”开头的只是一个标识符,需要忽略。
这是第一个任务的当前代码:
from docx import Document
doc = Document('BSA.docx')
peptide_list = ['MKWVTFISLLLLFSSAYSRGV', 'SSAYSRGVFRRDTHKSEIAH', 'KPKATEEQLKTVMENFVAFVDKCCA']
def highlight_peptides(text, keywords):
text = text.paragraphs[1].text
replacement = "3[91m" + "\1" + "3[39m"
enter code here`text = re.sub("(" + "|".join(map(re.escape, keywords)) + ")", replacement, text, flags=re.I)
highlight_peptides(doc, peptide_list)
问题是列表中的前两个短字符串重叠,在结果中只有第一个在序列中以红色突出显示。
请参阅下面的第一个 link,其中包含我正在获取的输出结果。
current result
查看第二个 link 以可视化我的 'ideal' 结果。
ideal result
在理想中,我还包括了第二个任务,即查找序列覆盖率。我不确定如何计算彩色或突出显示的字符。
您可以使用 third-party regex
module 进行重叠关键字搜索。然后,分两次完成匹配可能是最简单的:(1) 存储每个突出显示的段的开始和结束位置并组合任何重叠的部分:
import regex as re # important - not using the usual re module
def find_keywords(keywords, text):
""" Return a list of positions where keywords start or end within the text.
Where keywords overlap, combine them. """
pattern = "(" + "|".join(re.escape(word) for word in keywords) + ")"
r = []
for match in re.finditer(pattern, text, flags=re.I, overlapped=True):
start, end = match.span()
if not r or start > r[-1]:
r += [start, end] # add new segment
elif end > r[-1]:
r[-1] = end # combine with previous segment
return r
positions = find_keywords(keywords, text)
您的 'keyword coverage'(突出显示的百分比)可以计算为:
coverage = sum(positions[1::2]) - sum(positions[::2]) # sum of end positions - sum of start positions
percent_coverage = coverage * 100 / len(text)
然后 (2) 为文本添加格式,使用 run
properties in docx
:
import docx
def highlight_sections_docx(positions, text):
""" Add characters to a text to highlight the segments indicated by
a list of alternating start and end positions """
document = docx.Document()
p = document.add_paragraph()
for i, (start, end) in enumerate(zip([None] + positions, positions + [None])):
run = p.add_run(text[start:end])
if i % 2: # odd segments are highlighted
run.bold = True # or add other formatting - see https://python-docx.readthedocs.io/en/latest/api/text.html#run-objects
return document
doc = highlight_sections_docx(positions, text)
doc.save("my_word_doc.docx")
或者,您可以突出显示 html 中的文本,然后使用 htmldocx
程序包将其保存到 Word 文档:
def highlight_sections(positions, text, start_highlight="<mark>", end_highlight="</mark>"):
""" Add characters to a text to highlight the segments indicated by
a list of alternating start and end positions """
r = ""
for i, (start, end) in enumerate(zip([None] + positions, positions + [None])):
if i % 2: # odd segments are highlighted
r += start_highlight + text[start:end] + end_highlight
else: # even segments are not
r += text[start:end]
return r
from htmldocx import HtmlToDocx
s = highlight_sections(positions, text, start_highlight="<strong>", end_highlight="</strong>")
html = f"""<html><head></head><body><span style="width:100%; word-wrap:break-word; display:inline-block;">{s}</span></body></html>"""
HtmlToDocx().parse_html_string(html).save("my_word_doc.docx")
(<mark>
比 <strong>
更适合使用 html 标签,但不幸的是,HtmlToDocx 不保留 <mark>
的任何格式,并忽略 CSS 样式)。
highlight_sections
也可以用来输出到控制台:
print(highlight_sections(positions, text, start_highlight="3[91m", end_highlight="3[39m"))
... 或到 Jupyter / IPython 笔记本:
from IPython.core.display import HTML
s = highlight_sections(positions, text)
display(HTML(f"""<span style="width:100%; word-wrap:break-word; display:inline-block;">{s}</span>""")