正则表达式编译 Python 段落内 tabs/indents 之间的所有文本

Regex to compile all text between tabs/indents within paragraphs in Python

如果这是在其他地方,请提前致歉,但我一直在寻找,但我不擅长使用正则表达式。我正在使用正则表达式从包含段落的 word 文档中编译句子。我需要专门获取 2 个缩进之间的文本,或者如果有人可以帮助我找出我拥有的当前正则表达式(稍后显示),那么这也将起作用。例如,来自以下文本;

这是纯文本图像,但我无法获得相同的格式:

  1. 一种方法,包括:

    在第一区域存储与交通工具的运行方式相关的第一数据;

    在第二区域存储与交通工具的运营方式相关的第二数据;其中,第一和第二数据是基于运输机动通过第一和第二区域时的综合能量消耗效率;和

    根据综合能源消耗效率修改交通工具的功能。

  2. 根据权利要求1所述的方法,包括修改交通工具的功能以在符合一项或多项社会必需品和车辆法律的同时以最大综合效率消耗效率运行。

这是从我的函数中实际读入的文本:

  1. A method, comprising:            storing a first data related to an operation style of a transport in a first area;             storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and             modifying functionality of the transport based on the combined energy consumption efficiency.2.     The method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular laws.

当我打印从 .docx 文件读入的文本时,所有这些都输出到一行中

我需要提取以下几行:

storing a first data related to an operation style of a transport in a first area;

storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and

modifying functionality of the transport based on the combined energy consumption efficiency.

我当前的正则表达式模式如下:

pattern = re.compile(r"[ \t]+([^\s.;]+\s*)+[.;]+")

如前所述,如果有人能帮我弄清楚这个正则表达式,以便我读到分号或句点,那就太好了,否则,我知道我的部分问题是我有[ \t] 而不仅仅是 [\t],但是当我删除 space 时,我没有得到任何输出。此外,当前的正则表达式应该读到分号,但我改为读到下一个缩进,这样我就可以在之后解析句子并删除不必要的信息。如果有帮助的话,我当前的输出如下所示:

这里只是输出的原始文本:

A method, comprising:            storing a first data related to an operation style of a transport in a first storing a second data related to an operation style of the transport in a second first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second and             modifying functionality of the transport based on the combined energy consumption method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular

图像中的每一行文本都是我的代码的单个输出。从 .docx 的原始摘录中无法识别的任何文本只是 .docx 文件中的更多文本。

最后,这是我目前正在使用的代码:

def find_matches(text):
    print(text)
    pattern = re.compile(r"[ \t]+([^\s.;]+\s*)+[.;]+")
    return capitalize([m.group() for m in re.finditer(pattern, text)])


for match in find_matches(text=docText):
    ct += 1
    match_words = match.split(" ")
    match = " ".join(match_words[:-1])
    print(match)

所以,我只需要一些正则表达式来读取 intend to indent,再次抱歉,如果这是在其他地方,我根本找不到它。

我添加了这个位,因为我终于得到了一些带有正则表达式模式的输出,但是它似乎都是乱码,我假设是因为编码。这是我必须展示的代码:

doc = open('P.docx', mode='r', encoding = "ISO-8859-1")
docText = doc.read()
pattern = r"^[^.;]*\s{2,}([^\s.;]*(?:\s+[^\s.;]+)+[.;])"
print(re.findall(pattern, docText, re.MULTILINE))

这只是我使用它得到的输出的一部分(因为有很多):

'½ú\x04Ü\x13\x8eÕ\nõ+;', '\x7fîÙ(\x11\x90\x85íÆ\x83Bs\x15Ü\xa0g\x03i\x00a\x070§¬gÃo\x18Ë\x9a\x81i[¡\x8eÃ{\x96FÃ9\x9f\x8aãð6°AÏ>ö·\x98+\x80e·!f\x8d\x0e{\x12W\x1eéÝ}iûͨ½niü>Ú¶mB¥»\tÜÀªÓÿº$í}b^3¢¡7\t\x1amwR\x19ò\x96\x83"Hf\x0fòÑ«NÀ=áXÝP½²£ç\x1a\x01ZÁÍEÃÌ4ÒÄ\x90-dÌìáy½Þ|yFÕ,4ýÂÍ.', "ð`\x9c\n\x99´-Á:bÒÒY²O\x86\x88\x06'\x93°Îx4û§'?Ì÷\xad\x00m{N¸r6a\x86×8Û\x9drâúÙÄ9\x85\x91\x0c-;",

您可以使用 1 个或多个空格或制表符开始匹配,并在组中捕获您想要的内容。

^[ \t]+([^\s.;]+(?:\s+[^\s.;]+)*[.;])
  • ^ 字符串开头
  • [ \t]+ 匹配 1+ 个制表符或空格
  • ( 捕获 组 1
    • [^\s.;]+ 匹配 1+ 个非空白字符,除了 .;
    • (?:\s+[^\s.;]+)* 可选择重复匹配 1+ 个空白字符和 1+ 个非空白字符,.;
    • 除外
    • [.;] 匹配 .;
  • ) 关闭组 1

Regex demo | Python demo

例子

import re
from pprint import pprint
pattern = r"^[ \t]{2,}([^\s.;]+(?:\s+[^\s.;]+)+[.;])"

s = ("1. A method, comprising:\n\n"
     "      storing a first data related to an operation style of a transport in a first area;\n\n"
     "     storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and \n\n"
     "     modifying functionality of the transport based on the combined energy consumption efficiency.\n\n"
     "2. The method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular laws. \n\n"
     "And here is the text that is actually read in from my function:\n\n"
     "> 1. A method, comprising:            storing a first data related to an operation style of a transport in a first area;             storing a second data related to an operation style of the transport in a second area; wherein the first and second data is based on a combined energy consumption efficiency as the transport maneuvers through the first and second area; and             modifying functionality of the transport based on the combined energy consumption efficiency.2.     The method of claim 1, comprising modifying functionality of the transport to operate at a greatest combined efficiency consumption efficiency while in compliance with one or more of social necessities and vehicular laws.\n")

result = re.findall(pattern, s, re.MULTILINE)
pprint(result, width=100)

输出

['storing a first data related to an operation style of a transport in a first area;',
 'storing a second data related to an operation style of the transport in a second area;',
 'modifying functionality of the transport based on the combined energy consumption efficiency.']