Python 跨多行查找所有内容的正则表达式
Python regex to findall across multiple lines
过去一周我尝试解决这个问题,但没有取得任何进展。非常感谢大家的帮助。
我有 1000 个包含以下文本的文件:
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
但少数文件也有这种方式
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
我需要在 Python.
中使用正则表达式提取大写地址
从技术上讲,它是一个由非常旧的系统导出的 CSV 文件。它实际上不能用作 CSV,因此我选择提取字符串,假设它是一个纯文本文件。
我当前的代码是这样的,但我已经尝试了很多其他组合,但没有找到有效的解决方案。
location = re.findall(r'^Location:,,,(.*),,,,,,,,,,,,,\n$|^Location:,,,(.*)[\n.*]{1,2,3,4,5,6},,,,,,,,,,,,,', CSV, flags=re.DOTALL | re.MULTILINE)
我离得还近吗?或者有更好的方法解决这个问题吗?
在此感谢您的帮助。
这里有一个思路:可以使用简单的循环来检测和提取多行位置数据
# Test data
TEXT=""",,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
"""
in_location = False
tmp_location = None
def extract_location(l):
global in_location
global tmp_location
if l.startswith("Location:"):
in_location = True
tmp_location = []
# special case
if l.endswith(',,,,,,,,,,,,,'):
print(l[13:-13])
in_location = False
else:
tmp_location.append(l[13:]) # Don't need 'Location:,,,'
else:
if in_location:
tmp_location.append(l)
if l.endswith(',,,,,,,,,,,,,'):
# The end
in_location = False
res = " ".join(tmp_location)
print(res[0:-13]) # Remove trailing commas
def main():
for line in TEXT.split("\n"):
extract_location(line)
if __name__ == "__main__":
main()
假设它被保存到一个名为 concept.py
、
的文件中
$ python3 concept.py
DDRESS_HERE_THAT I WANT BUT IT CAN ALSO BE ACROSS, MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES, AND IT ENDS AS ABRUPTLY
DDRESS,IS,IN,ONE,LINE
鉴于您提供的虚拟数据:
s = ''',,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,'''
您可以使用以下正则表达式:
matches = re.findall(r'Location:((?:[^,]*,){16})', s, flags=re.MULTILINE)
这是比赛的样子:
>>> print('\n\n'.join(matches))
,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,
,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,
接下来要做什么取决于原始文件中逗号的含义。例如,您可能希望将它们替换为空格:
addrs = [match.replace(',', ' ').strip() for match in matches]
看起来像这样:
>>> print('\n\n'.join(addrs))
ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS
MULTIPLE LINES BUT NOT A SPECIFIC SET OF LINES
AND IT ENDS AS ABRUPTLY
ADDRESS IS IN ONE LINE
过去一周我尝试解决这个问题,但没有取得任何进展。非常感谢大家的帮助。
我有 1000 个包含以下文本的文件:
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
但少数文件也有这种方式
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
我需要在 Python.
中使用正则表达式提取大写地址从技术上讲,它是一个由非常旧的系统导出的 CSV 文件。它实际上不能用作 CSV,因此我选择提取字符串,假设它是一个纯文本文件。
我当前的代码是这样的,但我已经尝试了很多其他组合,但没有找到有效的解决方案。
location = re.findall(r'^Location:,,,(.*),,,,,,,,,,,,,\n$|^Location:,,,(.*)[\n.*]{1,2,3,4,5,6},,,,,,,,,,,,,', CSV, flags=re.DOTALL | re.MULTILINE)
我离得还近吗?或者有更好的方法解决这个问题吗?
在此感谢您的帮助。
这里有一个思路:可以使用简单的循环来检测和提取多行位置数据
# Test data
TEXT=""",,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
"""
in_location = False
tmp_location = None
def extract_location(l):
global in_location
global tmp_location
if l.startswith("Location:"):
in_location = True
tmp_location = []
# special case
if l.endswith(',,,,,,,,,,,,,'):
print(l[13:-13])
in_location = False
else:
tmp_location.append(l[13:]) # Don't need 'Location:,,,'
else:
if in_location:
tmp_location.append(l)
if l.endswith(',,,,,,,,,,,,,'):
# The end
in_location = False
res = " ".join(tmp_location)
print(res[0:-13]) # Remove trailing commas
def main():
for line in TEXT.split("\n"):
extract_location(line)
if __name__ == "__main__":
main()
假设它被保存到一个名为 concept.py
、
$ python3 concept.py
DDRESS_HERE_THAT I WANT BUT IT CAN ALSO BE ACROSS, MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES, AND IT ENDS AS ABRUPTLY
DDRESS,IS,IN,ONE,LINE
鉴于您提供的虚拟数据:
s = ''',,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,
Location:,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,'''
您可以使用以下正则表达式:
matches = re.findall(r'Location:((?:[^,]*,){16})', s, flags=re.MULTILINE)
这是比赛的样子:
>>> print('\n\n'.join(matches))
,,,ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS,
MULTIPLE LINES, BUT NOT A SPECIFIC SET OF LINES,
AND IT ENDS AS ABRUPTLY,,,,,,,,,,
,,,ADDRESS,IS,IN,ONE,LINE,,,,,,,,,
接下来要做什么取决于原始文件中逗号的含义。例如,您可能希望将它们替换为空格:
addrs = [match.replace(',', ' ').strip() for match in matches]
看起来像这样:
>>> print('\n\n'.join(addrs))
ADDRESS_HERE_THAT I WANT
BUT IT CAN ALSO BE ACROSS
MULTIPLE LINES BUT NOT A SPECIFIC SET OF LINES
AND IT ENDS AS ABRUPTLY
ADDRESS IS IN ONE LINE