正则表达式在多行上搜索文本

Regex searching for text on multiple lines

我正在尝试使用正则表达式语句提取两个已知短语之间的特定文本块,这些短语将在其他文档中重复,并删除其他所有内容。这几句话将被传递到其他函数中。

我的问题似乎是,当我使用在同一行上包含我正在搜索的词的正则表达式语句时,它起作用了。如果他们在不同的线路上,我得到:

print(match.group(1).strip())
AttributeError: 'NoneType' object has no attribute 'group'

我希望未来的报告在不同的位置有换行符,具体取决于之前写的内容 - 有没有办法通过删除所有换行符来首先准备文本,或者让我的正则表达式语句在搜索时忽略那些?

任何帮助都会很棒,谢谢!

import fitz
import re

doc = fitz.open(r'file.pdf')
text_list = [ ]
for page in doc:
    text_list.append(page.getText())
    #print(text_list[-1])
text_string = ' '.join(text_list)
test_string = "Observations of Client Behavior: THIS IS THE DESIRED TEXT. Observations of Client's response to skill acquisition" #works for this test
pat = r".*?Observations of Client Behavior: (.*) Observations of Client's response to skill acquisition*"

match = re.search(pat, text_string)
print(match.group(1).strip())

当我在 pat 中搜索长文本文件中位于同一行的短语时,它起作用了。但是一旦他们在不同的线路上,它就不再有效了。

这是给我一个问题的输入文本示例:

Observations of Client Behavior: Overall interfering behavior data trends are as followed: Aggression frequency 
has been low and stable at 0 occurrences for the past two consecutive sessions. Elopement frequency is on an 
overall decreasing trend. Property destruction frequency is on an overall decreasing trend. Non-compliance 
frequency has been stagnant at 2 occurrences for the past two consecutive sessions, but overall on a 
decreasing trend. Tantrum duration data are variable; data were at 89 minutes on 9/27/21, but have starkly 
decreased to 0 minutes for the past two consecutive sessions. Observations of Client's response to skill 
acquisition: Overall skill acquisition data trends are as followed: Frequency of excessive mands 

注意 . 匹配换行符以外的任何字符。所以你可以使用 (.|\n) 来捕获所有内容。此外,该行似乎可能会在您的固定模式内中断。首先定义模式的前缀和后缀:

prefix=r"Observations\s+of\s+Client\s+Behavior:"
sufix=r"Observations\s+of\s+Client's\s+response\s+to\s+skill\s+acquisition:"

然后创建模式并查找所有匹配项:

pattern=prefix+r"((?:.|\n)*?)"+suffix
f=re.findall(pattern,text_string)

通过在 r"((?:.|\n)*?)" 末尾使用 *?,我们匹配尽可能少的字符。

多行多模式示例:

text_string = '''any thing Observations of Client Behavior: patern1 Observations of Client's 
response to skill acquisition: any thing
any thing Observations of Client Behavior: patern2 Observations of 
Client's response to skill acquisition: any thing Observations of Client
Behavior: patern3 Observations of Client's response to skill acquisition: any thing any thing'''

result=re.findall(pattern,text_string)

result=[' patern1 ', ' patern2 ', ' patern3 ']

检查结果here