用于匹配 Table 中具有多行的字符串的正则表达式
Regular expression to match string from Table having multiple rows
我正在开发一个解析器来从具有多行的 Table 中提取字符串。实际上每一行都有多行字符串。
这是来自其中一行的字符串:
'02/03/20 Test String in line1 3431 1.50 hrs.
Test String in line2
Test String in line3
18/05/20 Test String in line4 1234 .50 hrs.
Test String in line5
Test String in line6
'''
search = '(?P<Date>\d{2}/\d{2}/\d{2}\s+)\s+(?P<Description>\w.*)\s+(?P<Code>[0-9]+)\s+(?P<Hours>[0-9.-]+\s)'
matches= re.search(search, str2)
print("Date:", matches.group('Date'))
print("Description:", matches.group('Description'))
print("Code:", matches.group('Code'))
print("Hours:", matches.group('Hours'))
但是,它只提取第一行的内容,其余行将被忽略。我得到的输出如下:
Date: 02/03/20
Description: Test String in line1
Code: 3431
Hours: 1.50
知道如何确保考虑所有其余行吗?
这是捕获多行的一种方法。
要从多行中捕获,您需要设置 flag = re.M
import re
str2 = '''02/03/20 Test String in line1 3431 1.50 hrs.
03/04/20 Test String in line2 3211 1.20 hrs.
03/04/20 Test String in line3 1111 2.20 hrs.'''
search = '(?P<Date>\d{2}/\d{2}/\d{2}\s+)\s+(?P<Description>\w.*)\s+(?P<Code>[0-9]+)\s+(?P<Hours>[0-9.-]+\s)'
matches= re.finditer(search, str2)
for match in matches:
print(match.groupdict())
print("\n")
Output:"{'Date': '02/03/20 ', 'Description': 'Test String in line1 ', 'Code': '3431', 'Hours': '1.50 '}
{'Date': '03/04/20 ', 'Description': 'Test String in line2 ', 'Code': '3211', 'Hours': '1.20 '}
{'Date': '03/04/20 ', 'Description': 'Test String in line3 ', 'Code': '1111', 'Hours': '2.20 '}
"
这是一个使用正则表达式的解决方案,该正则表达式要么匹配您的第一行(基本上与您现有的正则表达式相同),要么匹配一行中的某些单词(捕获到 Description2
组中)。我们使用 re.finditer()
遍历匹配项,当我们遇到新的第一行时输出之前的匹配项,并在我们匹配 second/third/etc 时添加到描述中。行:
import re
str2 = '''02/03/20 Test String in line1 3431 1.50 hrs.
Test String in line2
Test String in line3
18/05/20 Test String in line4 1234 .50 hrs.
Test String in line5
Test String in line6
22/05/20 Test String in line7 1852 3.60 hrs.
30/05/20 Test String in line8 4567 8 hrs.
'''
search = '^\s*(?P<Date>\d{2}/\d{2}/\d{2})\s+(?P<Description1>\w.*)\s+(?P<Code>[0-9]+)\s+(?P<Hours>[0-9.-]+).*|(?P<Description2>\w.*?)\s*$'
matches= re.finditer(search, str2, re.M)
date = None
for m in matches:
if (m.group('Date')) is not None:
if date is not None:
# new match, print out the previous one
print("Date:", date)
print("Description:", descr)
print("Code:", code)
print("Hours:", hours)
print()
date = m.group('Date')
descr = m.group('Description1')
code = m.group('Code')
hours = m.group('Hours')
else:
descr = descr + '\n' + m.group('Description2')
# print out last match
print("Date:", date)
print("Description:", descr)
print("Code:", code)
print("Hours:", hours)
输出:
Date: 02/03/20
Description: Test String in line1
Test String in line2
Test String in line3
Code: 3431
Hours: 1.50
Date: 18/05/20
Description: Test String in line4
Test String in line5
Test String in line6
Code: 1234
Hours: .50
Date: 22/05/20
Description: Test String in line7
Code: 1852
Hours: 3.60
Date: 30/05/20
Description: Test String in line8
Code: 4567
Hours: 8
我正在开发一个解析器来从具有多行的 Table 中提取字符串。实际上每一行都有多行字符串。
这是来自其中一行的字符串:
'02/03/20 Test String in line1 3431 1.50 hrs.
Test String in line2
Test String in line3
18/05/20 Test String in line4 1234 .50 hrs.
Test String in line5
Test String in line6
'''
search = '(?P<Date>\d{2}/\d{2}/\d{2}\s+)\s+(?P<Description>\w.*)\s+(?P<Code>[0-9]+)\s+(?P<Hours>[0-9.-]+\s)'
matches= re.search(search, str2)
print("Date:", matches.group('Date'))
print("Description:", matches.group('Description'))
print("Code:", matches.group('Code'))
print("Hours:", matches.group('Hours'))
但是,它只提取第一行的内容,其余行将被忽略。我得到的输出如下:
Date: 02/03/20
Description: Test String in line1
Code: 3431
Hours: 1.50
知道如何确保考虑所有其余行吗?
这是捕获多行的一种方法。 要从多行中捕获,您需要设置 flag = re.M
import re
str2 = '''02/03/20 Test String in line1 3431 1.50 hrs.
03/04/20 Test String in line2 3211 1.20 hrs.
03/04/20 Test String in line3 1111 2.20 hrs.'''
search = '(?P<Date>\d{2}/\d{2}/\d{2}\s+)\s+(?P<Description>\w.*)\s+(?P<Code>[0-9]+)\s+(?P<Hours>[0-9.-]+\s)'
matches= re.finditer(search, str2)
for match in matches:
print(match.groupdict())
print("\n")
Output:"{'Date': '02/03/20 ', 'Description': 'Test String in line1 ', 'Code': '3431', 'Hours': '1.50 '}
{'Date': '03/04/20 ', 'Description': 'Test String in line2 ', 'Code': '3211', 'Hours': '1.20 '}
{'Date': '03/04/20 ', 'Description': 'Test String in line3 ', 'Code': '1111', 'Hours': '2.20 '}
"
这是一个使用正则表达式的解决方案,该正则表达式要么匹配您的第一行(基本上与您现有的正则表达式相同),要么匹配一行中的某些单词(捕获到 Description2
组中)。我们使用 re.finditer()
遍历匹配项,当我们遇到新的第一行时输出之前的匹配项,并在我们匹配 second/third/etc 时添加到描述中。行:
import re
str2 = '''02/03/20 Test String in line1 3431 1.50 hrs.
Test String in line2
Test String in line3
18/05/20 Test String in line4 1234 .50 hrs.
Test String in line5
Test String in line6
22/05/20 Test String in line7 1852 3.60 hrs.
30/05/20 Test String in line8 4567 8 hrs.
'''
search = '^\s*(?P<Date>\d{2}/\d{2}/\d{2})\s+(?P<Description1>\w.*)\s+(?P<Code>[0-9]+)\s+(?P<Hours>[0-9.-]+).*|(?P<Description2>\w.*?)\s*$'
matches= re.finditer(search, str2, re.M)
date = None
for m in matches:
if (m.group('Date')) is not None:
if date is not None:
# new match, print out the previous one
print("Date:", date)
print("Description:", descr)
print("Code:", code)
print("Hours:", hours)
print()
date = m.group('Date')
descr = m.group('Description1')
code = m.group('Code')
hours = m.group('Hours')
else:
descr = descr + '\n' + m.group('Description2')
# print out last match
print("Date:", date)
print("Description:", descr)
print("Code:", code)
print("Hours:", hours)
输出:
Date: 02/03/20
Description: Test String in line1
Test String in line2
Test String in line3
Code: 3431
Hours: 1.50
Date: 18/05/20
Description: Test String in line4
Test String in line5
Test String in line6
Code: 1234
Hours: .50
Date: 22/05/20
Description: Test String in line7
Code: 1852
Hours: 3.60
Date: 30/05/20
Description: Test String in line8
Code: 4567
Hours: 8