如何从 python 中的关键字开始并以不同关键字结束的字符串中提取特定行?

How do I extract specific lines from a string starting from a keyword and ending at a different keyword in python?

我的代码的目标是能够从 word 文档中获取文本,并为每个存在关键字的实例获取行,直到关联的部件号,例如:

The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component,

会变成:

detecting, by a component in a transport, that another component has been removed 244C

除此之外,我需要获取该文本,并将其置于我用代码创建的图像中。这是我的代码:

import re
import time
import textwrap
from docx import Document
from PIL import Image, ImageFont, ImageDraw

doc = Document('PatentDocument.docx')
docText = ''.join(paragraph.text for paragraph in doc.paragraphs)
print(docText)

for i, p in enumerate(docText):
    W, H = 300, 300
    body = Image.new('RGB', (W, H), (255, 255, 255))
    border = Image.new('RGB', (W + 2, H + 2), (0, 0, 0))
    border.save('border.png')
    body.save('body.png')
    patent = Image.open('border.png')
    patent.paste(body, (1, 1))
    draw = ImageDraw.Draw(patent)
    font = ImageFont.load_default()

    current_h, pad = 60, 20
    keywords = ['responsive', 'detecting', 'providing', 'Responsive', 'Detecting', 'Providing']
    pattern = re.compile('|'.join(keywords))
    parts = re.findall("\d{1,3}[C]", docText)
    print(parts)
    for keywords in textwrap.wrap(docText, width=50):
        line = keywords.encode('utf-8')
        w, h = draw.textsize(line, font=font)
        draw.text(((W-w)/2, current_h), line, (0, 0, 0), font=font)
        current_h += h + pad

    patent.save(f'patent_{i+1}_{time.strftime("%Y%m%d%H%M%S")}.png')

我的代码目前做的是打印 word 文档中的整个文本的字符串,并输出整个文本的图像 500+ 次,这是字符串的字符计数。这是我的输出之一的示例:

这个输出重复了 500 多次。 除此之外,这些在 运行 window:

中得到输出

[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C. ['244C', '246C', '248C', '249C']

除了,段落后面的数组也重复了 500 多次。

这是我正在读取并转换为单个字符串的 word 文档:

[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C.

我目前想知道如何从我制作的字符串中提取特定的行。输出应该是这样的——忽略方框和居中——我只想输出我给出的段落中的那些行:

一些伪代码类似于:

for keyword in docText:
     print({keyword, part number})

我目前的实现是使用 docx、PIL 和 re,尽管我很乐意使用任何能够实现我的目标的东西。任何帮助!

因此,在外部资源的帮助下,我设法解决了所有问题。减去输出到带有居中文本的图像的代码等等,这是解决我的主要问题的代码:

from docx import Document
from PIL import Image, ImageFont, ImageDraw

doc = Document('PatentDocument.docx')
docText = ''.join(paragraph.text for paragraph in doc.paragraphs)
print(docText)


def get(source, begin, end):
    try:
        start = source.index(len(begin)) + len(begin)
        finish = source.index(len(end), len(start))
        return source[start:finish]
    except ValueError:
        return ""


def create_regex(keywords=('responsive', 'providing', 'detecting')):
    re.compile('([Rr]esponsive|[Pp]oviding|[Dd]etecting).*?(\d{1,3}C)')
    regex = (
        "("
        + "|".join((f"[{k[0].upper()}{k[0].lower()}]{k[1:]}" for k in keywords))
        + ")"
        + ".*?(\d{1,3}C)"
    )
    return re.compile(regex)


def find_matches(text, keywords):
    return [m.group() for m in re.finditer(create_regex(keywords), text)]


for match in find_matches(
    text=docText, keywords=("responsive", "detecting", "providing")
):
    print(match)

所以,从源文档:

[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C.

我得到以下输出:

[0054] The processor 204 performs one or more of detecting, by a component in a transport, that another component has been removed 244C, detecting, by the component, that a replacement component has been added in the transport 246C, providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C, and responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C.

detecting, by a component in a transport, that another component has been removed 244C

detecting, by the component, that a replacement component has been added in the transport 246C

providing, by the component, data to the replacement component, wherein the data attempts to subvert an authorized functionality of the replacement component 248C

responsive to a non-subversion of the authorized functionality, permitting, by the component, use of the authorized functionality of the replacement component 249C

关键字字符串后面打印的字符串之间没有空格,但为了阅读方便,我将它们分开。希望这可以帮助其他人!