如何从 Python 文件中提取两个子字符串之间的文本

Question

我想从文件中读取两个字符（“#*” 和 “#@”）之间的文本。我的文件包含数千条上述格式的记录。我尝试使用下面的代码，但它没有返回所需的输出。我的数据包含给定格式的数千条记录。

import re
start = '#*'
end = '#@'
myfile = open('lorem.txt')
for line in fhand:
    text = text.rstrip()
    print (line[line.find(start)+len(start):line.rfind(end)])
myfile.close()

我的输入：

\#*OQL[C++]: Extending C++ with an Object Query Capability

\#@José A. Blakeley

\#t1995

\#cModern Database Systems

\#index0

\#*Transaction Management in Multidatabase Systems

\#@Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz

\#t1995

\#cModern Database Systems

\#index1

我的输出：

51103
OQL[C++]: Extending C++ with an Object Query Capability

t199
cModern Database System
index
...

预期输出：

OQL[C++]: Extending C++ with an Object Query Capability
Transaction Management in Multidatabase Systems

Answer 1

使用以下正则表达式：

#\*([\s\S]*?)#@ /g

此正则表达式捕获 #* 和 #@ 之间的所有空格和 non-whitespace 个字符。

Demo

Answer 2

您正在逐行阅读文件，但您的匹配跨行。您需要读入文件并使用可以跨行匹配任何字符的正则表达式对其进行处理：

import re
start = '#*'
end = '#@'
rx = r'{}.*?{}'.format(re.escape(start), re.escape(end)) # Escape special chars, build pattern dynamically
with open('lorem.txt') as myfile:
    contents = myfile.read()                     # Read file into a variable
    for match in re.findall(rx, contents, re.S): # Note re.S will make . match line breaks, too
        # Process each match individually

参见regex demo。

如何从 Python 文件中提取两个子字符串之间的文本

How to extract text between two substrings from a Python file

regex

text-manipulation

python-3.x