用于匹配 Table 中具有多行的字符串的正则表达式

Question

我正在开发一个解析器来从具有多行的 Table 中提取字符串。实际上每一行都有多行字符串。

这是来自其中一行的字符串：

'02/03/20        Test String in line1              3431           1.50 hrs.
                 Test String in line2
                 Test String in line3

18/05/20        Test String in line4              1234           .50 hrs.
                 Test String in line5
                 Test String in line6                

                 '''



search = '(?P<Date>\d{2}/\d{2}/\d{2}\s+)\s+(?P<Description>\w.*)\s+(?P<Code>[0-9]+)\s+(?P<Hours>[0-9.-]+\s)'

matches= re.search(search, str2)

print("Date:", matches.group('Date'))
print("Description:", matches.group('Description'))
print("Code:", matches.group('Code'))
print("Hours:", matches.group('Hours'))

但是，它只提取第一行的内容，其余行将被忽略。我得到的输出如下：

Date: 02/03/20       
Description: Test String in line1            
Code: 3431
Hours: 1.50

知道如何确保考虑所有其余行吗？

Answer 1

这是捕获多行的一种方法。要从多行中捕获，您需要设置 flag = re.M

import re
str2 = '''02/03/20       Test String in line1              3431           1.50 hrs.
          03/04/20       Test String in line2              3211           1.20 hrs. 
          03/04/20       Test String in line3              1111           2.20 hrs.'''


search = '(?P<Date>\d{2}/\d{2}/\d{2}\s+)\s+(?P<Description>\w.*)\s+(?P<Code>[0-9]+)\s+(?P<Hours>[0-9.-]+\s)'

matches= re.finditer(search, str2)
for match in matches:
    print(match.groupdict())
    print("\n")

Output:"{'Date': '02/03/20      ', 'Description': 'Test String in line1             ', 'Code': '3431', 'Hours': '1.50 '}


{'Date': '03/04/20      ', 'Description': 'Test String in line2             ', 'Code': '3211', 'Hours': '1.20 '}


{'Date': '03/04/20      ', 'Description': 'Test String in line3             ', 'Code': '1111', 'Hours': '2.20 '}
"

Answer 2

这是一个使用正则表达式的解决方案，该正则表达式要么匹配您的第一行（基本上与您现有的正则表达式相同），要么匹配一行中的某些单词（捕获到 Description2 组中）。我们使用 re.finditer() 遍历匹配项，当我们遇到新的第一行时输出之前的匹配项，并在我们匹配 second/third/etc 时添加到描述中。行：

import re

str2 = '''02/03/20        Test String in line1              3431           1.50 hrs.
                 Test String in line2
                 Test String in line3

18/05/20        Test String in line4              1234           .50 hrs.
                 Test String in line5
                 Test String in line6                

 22/05/20        Test String in line7              1852           3.60 hrs.
 30/05/20        Test String in line8              4567           8 hrs.
               '''


search = '^\s*(?P<Date>\d{2}/\d{2}/\d{2})\s+(?P<Description1>\w.*)\s+(?P<Code>[0-9]+)\s+(?P<Hours>[0-9.-]+).*|(?P<Description2>\w.*?)\s*$'

matches= re.finditer(search, str2, re.M)
date = None
for m in matches:
    if (m.group('Date')) is not None:
        if date is not None:
            # new match, print out the previous one
            print("Date:", date)
            print("Description:", descr)
            print("Code:", code)
            print("Hours:", hours)
            print()
        date = m.group('Date')
        descr = m.group('Description1')
        code = m.group('Code')
        hours = m.group('Hours')
    else:
        descr = descr + '\n' + m.group('Description2')

# print out last match
print("Date:", date)
print("Description:", descr)
print("Code:", code)
print("Hours:", hours)

输出：

Date: 02/03/20
Description: Test String in line1             
Test String in line2
Test String in line3
Code: 3431
Hours: 1.50

Date: 18/05/20
Description: Test String in line4             
Test String in line5
Test String in line6
Code: 1234
Hours: .50

Date: 22/05/20
Description: Test String in line7             
Code: 1852
Hours: 3.60

Date: 30/05/20
Description: Test String in line8             
Code: 4567
Hours: 8

用于匹配 Table 中具有多行的字符串的正则表达式

Regular expression to match string from Table having multiple rows

python

regex

match

regex-group