Python 正则表达式 - 在文本文件中的（多个）表达式之间提取文本

Question

我是 Python 初学者，如果您能帮助我解决文本提取问题，我将不胜感激。

我想提取位于文本文件中两个表达式之间的所有文本（字母的开头和结尾）。对于字母的开头和结尾，有多个可能的表达式（在列表 "letter_begin" 和 "letter_end" 中定义，例如 "Dear"、"to our" 等）。我想分析一堆文件，在下面找到一个这样的文本文件的例子 -> 我想提取从 "Dear" 到 "Douglas" 的所有文本。在 "letter_end" 没有匹配的情况下，即没有找到 letter_end 表达式，输出应该从 letter_beginning 开始并在要分析的文本文件的最后结束。

编辑："the recorded text" 的结尾必须在 "letter_end" 的匹配项之后和第一行 20 个或更多字符之前（如 "Random text here as well" -> len=24.

"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""

到目前为止，这是我的代码 - 但它无法灵活地捕捉表达式之间的文本（可以是 "letter_begin" 之前的任何内容（行、文本、数字、符号等）和在 "letter_end")

之后

import re

letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"


with open(filename, 'r', encoding="utf-8") as infile:
         text = infile.read()
         text = str(text)
         output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
         print (output)

非常感谢大家的帮助！

Answer 1

您可以使用

regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)

此模式将生成类似

的正则表达式

(?:dear|to our|estimated)[\s\S]*?(?:sincerely|yours|best regards).*(?:\n.*){0,2}

见regex demo。请注意，您不应将 re.DOTALL 与此模式一起使用，并且 re.MULTILINE 选项也是多余的。

详情

(?:dear|to our|estimated) - 三个值中的任何一个
[\s\S]*? - 任何 0+ 个字符，尽可能少
(?:sincerely|yours|best regards) - 三个值中的任何一个
.* - 除换行符外的任何 0+ 个字符
(?:\n.*){0,2} - 零次、一次或两次重复的换行符后跟除换行符以外的任何 0+ 个字符。

Python demo code:

import re
text="""Some random text here

Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))

输出：

['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']

Python 正则表达式 - 在文本文件中的（多个）表达式之间提取文本

Python Regex - Extract text between (multiple) expressions in a textfile

python

regex

text-extraction

text-mining