使用 python-docx 在文本文件中突出显示或加粗字符串?

Highlight or bold strings in a text file using python-docx?

我有一个'short strings'的列表,比如:

['MKWVTFISLLLLFSSAYSRGV', 'SSAYSRGVFRRDTHKSEIAH', 'KPKATEEQLKTVMENFVAFVDKCCA']

我需要匹配包含在 word 文件 (BSA.docx) 或 .txt 文件(无关紧要)中的 'long string',例如:

sp|P02769|ALBU_BOVIN Albumin OS=Bos taurus OX=9913 GN=ALB PE=1 SV=4 MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCVADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPKLKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKGACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLVTDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEKDAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEATLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKVPQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTKCCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKHKPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA

我想使用 python(在终端或 jupyter 笔记本中)获得以下内容:

  1. 在长字符串中突出显示较短的字符串匹配项。高亮样式不重要,可以用黄色标记高亮或者加粗,或者下划线,跳到眼睛里看有没有匹配的都可以。

  2. 求长字符串的覆盖率为((突出显示的字符数)/(长字符串的总长度))*100。请注意,长字符串的第一行以“>>”开头的只是一个标识符,需要忽略。

这是第一个任务的当前代码:

from docx import Document

doc = Document('BSA.docx')

peptide_list = ['MKWVTFISLLLLFSSAYSRGV', 'SSAYSRGVFRRDTHKSEIAH', 'KPKATEEQLKTVMENFVAFVDKCCA']

def highlight_peptides(text, keywords):
    text = text.paragraphs[1].text
    replacement = "3[91m" + "\1" + "3[39m"
    enter code here`text = re.sub("(" + "|".join(map(re.escape, keywords)) + ")", replacement, text, flags=re.I)
    

highlight_peptides(doc, peptide_list)

问题是列表中的前两个短字符串重叠,在结果中只有第一个在序列中以红色突出显示。

请参阅下面的第一个 link,其中包含我正在获取的输出结果。

current result

查看第二个 link 以可视化我的 'ideal' 结果。

ideal result

在理想中,我还包括了第二个任务,即查找序列覆盖率。我不确定如何计算彩色或突出显示的字符。

您可以使用 third-party regex module 进行重叠关键字搜索。然后,分两次完成匹配可能是最简单的:(1) 存储每个突出显示的段的开始和结束位置并组合任何重叠的部分:

import regex as re # important - not using the usual re module

def find_keywords(keywords, text):
    """ Return a list of positions where keywords start or end within the text. 
    Where keywords overlap, combine them. """
    pattern = "(" + "|".join(re.escape(word) for word in keywords) + ")"
    r = []
    for match in re.finditer(pattern, text, flags=re.I, overlapped=True):
        start, end = match.span()
        if not r or start > r[-1]: 
            r += [start, end]  # add new segment
        elif end > r[-1]:
            r[-1] = end        # combine with previous segment
    return r

positions = find_keywords(keywords, text)

您的 'keyword coverage'(突出显示的百分比)可以计算为:

coverage = sum(positions[1::2]) - sum(positions[::2]) # sum of end positions - sum of start positions
percent_coverage = coverage * 100 / len(text)

然后 (2) 为文本添加格式,使用 run properties in docx:

import docx

def highlight_sections_docx(positions, text):
    """ Add characters to a text to highlight the segments indicated by
     a list of alternating start and end positions """
    document = docx.Document()
    p = document.add_paragraph()
    for i, (start, end) in enumerate(zip([None] + positions, positions + [None])):
        run = p.add_run(text[start:end])
        if i % 2:  # odd segments are highlighted
            run.bold = True   # or add other formatting - see https://python-docx.readthedocs.io/en/latest/api/text.html#run-objects
    return document

doc = highlight_sections_docx(positions, text)
doc.save("my_word_doc.docx")

或者,您可以突出显示 html 中的文本,然后使用 htmldocx 程序包将其保存到 Word 文档:

def highlight_sections(positions, text, start_highlight="<mark>", end_highlight="</mark>"):
    """ Add characters to a text to highlight the segments indicated by
     a list of alternating start and end positions """
    r = ""
    for i, (start, end) in enumerate(zip([None] + positions, positions + [None])):
        if i % 2:  # odd segments are highlighted
            r += start_highlight + text[start:end] + end_highlight
        else:      # even segments are not
            r += text[start:end]
    return r

from htmldocx import HtmlToDocx
s = highlight_sections(positions, text, start_highlight="<strong>", end_highlight="</strong>")
html = f"""<html><head></head><body><span style="width:100%; word-wrap:break-word; display:inline-block;">{s}</span></body></html>"""
HtmlToDocx().parse_html_string(html).save("my_word_doc.docx")

(<mark><strong> 更适合使用 html 标签,但不幸的是,HtmlToDocx 不保留 <mark> 的任何格式,并忽略 CSS 样式)。

highlight_sections也可以用来输出到控制台:

print(highlight_sections(positions, text, start_highlight="3[91m", end_highlight="3[39m"))

... 或到 Jupyter / IPython 笔记本:

from IPython.core.display import HTML
s = highlight_sections(positions, text)
display(HTML(f"""<span style="width:100%; word-wrap:break-word; display:inline-block;">{s}</span>""")