Python 多行正则表达式
Python multi-line regex
我正在使用 pdfplumber.page.extract_text() 从银行对帐单中提取文本。文本似乎已正确提取,但我在使用正则表达式提取日期、类型、描述和数量时遇到问题。但我想不出一种干净的方法来捕获多行描述。我希望将金色框内的描述文本与金色框前一行中的描述文本分组。
正则表达式
re.findall(r'(\d{2}\/\d{2})\s*([\w ]*)([$\d.,]*)(\s{2})([$\d.,]*).*\s(?=\w*)', text)
正则表达式描述
(\d{2}\/\d{2}) - Capture date
([\w ]*) - Capture description
([$\d.,]*) - Capture expense amount
([$\d.,]*) - Capture deposit amount
(?=\w*) - Positive Lookahead for any text below
输入
0 0 ,345.67
08/27 DEBIT CARD PURCHASE XXXXXX 5541XXXXXX .23 0 3,456.78
RACETRAC467 00004671 PLEASANTVILLEPA
08/27 BANK FUNDS TRANSFER DB .67 0 4,816.32
TO SMITH,JOHN
SAVINGS #0001, CONF# 8675309
continued on next page>>>
987654-3210
Page 1 of 11
当前输出
['08/27', 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX ', '.23', ' ', '0', ' 3,456.78 ']
['08/27', 'BANK FUNDS TRANSFER DB ', '.67', ' ', '0', ' 4,816.32 ']
期望的输出
['08/27', 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX RACETRAC467 00004671 PLEASANTVILLEPA ', '.23', ' ', '0', ' 3,456.78 ']
['08/27', 'BANK FUNDS TRANSFER DB TO SMITH,JOHN SAVINGS #0001, CONF# 8675309 ', '.67', ' ', '0', ' 4,816.32 ']
您可以将以下行的描述(例如不以日期或“续”或页码和数字开头)添加到您已有的描述中。
在您的模式中,您使用 [\w ]*
但这也只能匹配空格。如果至少应该有一个单词字符,您可以使用 \w[\w ]*
您也可以省略此部分中的捕获组 (\s{2})
,因为它将 return 一个仅包含空格的条目。
(?P<date>\d{2}/\d{2})\s+(?P<desc>\w[\w ]*)(?P<expense>$[\d.,]*)\s{2}(?P<deposit>\d[\d.,]*)\s.*(?P<desc_more>(?:\n(?!\d+\/\d|continued\b|Page\s+\d).*)*)
模式匹配:
(?P<date>\d{2}/\d{2})
组日期
\s+
匹配 1+ 个空白字符
(?P<desc>\w[\w ]*)
组 desc 匹配单词字符和空格
(?P<expense>$[\d.,]*)
组 expense 匹配 $
和可选数字 .
或 ,
\s{2}
匹配 2 个空白字符
(?P<deposit>\d[\d.,]*)
组存款匹配一个数字和可选数字.
或,
\s.*
匹配单个空白字符和该行的其余部分
(?P<desc_more>
组desc_more
(?:
非捕获组整体匹配
\n(?!\d+\/\d|continued\b|Page\s+\d).*
匹配一个换行符,如果它不是以类似模式或任何其他替代形式的日期开头,则匹配该行的其余部分
)*
关闭非捕获组并选择性重复
)
关闭群desc_more
看到一个regex demo and a Python demo.
使用命名捕获组和 match.groupdict()
:
的示例
import re
pattern = r"(?P<date>\d{2}/\d{2})\s+(?P<desc>\w[\w ]*)(?P<expense>$[\d.,]*)\s{2}(?P<deposit>\d[\d.,]*)\s.*(?P<desc_more>(?:\n(?!\d+\/\d|continued\b|Page\s+\d).*)*)"
s = (" 0 0 ,345.67 \n"
"08/27 DEBIT CARD PURCHASE XXXXXX 5541XXXXXX .23 0 3,456.78\n"
"RACETRAC467 00004671 PLEASANTVILLEPA\n"
"08/27 BANK FUNDS TRANSFER DB .67 0 4,816.32\n"
"TO SMITH,JOHN\n"
"SAVINGS #0001, CONF# 8675309\n"
"continued on next page>>>\n"
" 987654-3210\n"
"Page 1 of 11\n"
"07/27 DEBIT CARD PURCHASE XXXXXX 6541XXXXXX .23 0 3,456.78")
matches = re.finditer(pattern, s)
for _, match in enumerate(matches):
d = match.groupdict()
d.update({'desc': re.sub(r"[^\S\n]*\n", " " , match.groupdict().get('desc') + match.groupdict().get('desc_more'))})
del d["desc_more"]
print(d)
输出
{'date': '08/27', 'desc': 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX RACETRAC467 00004671 PLEASANTVILLEPA', 'expense': '.23', 'deposit': '0'}
{'date': '08/27', 'desc': 'BANK FUNDS TRANSFER DB TO SMITH,JOHN SAVINGS #0001, CONF# 8675309', 'expense': '.67', 'deposit': '0'}
{'date': '07/27', 'desc': 'DEBIT CARD PURCHASE XXXXXX 6541XXXXXX ', 'expense': '.23', 'deposit': '0'}
我正在使用 pdfplumber.page.extract_text() 从银行对帐单中提取文本。文本似乎已正确提取,但我在使用正则表达式提取日期、类型、描述和数量时遇到问题。但我想不出一种干净的方法来捕获多行描述。我希望将金色框内的描述文本与金色框前一行中的描述文本分组。
正则表达式
re.findall(r'(\d{2}\/\d{2})\s*([\w ]*)([$\d.,]*)(\s{2})([$\d.,]*).*\s(?=\w*)', text)
正则表达式描述
(\d{2}\/\d{2}) - Capture date
([\w ]*) - Capture description
([$\d.,]*) - Capture expense amount
([$\d.,]*) - Capture deposit amount
(?=\w*) - Positive Lookahead for any text below
输入
0 0 ,345.67
08/27 DEBIT CARD PURCHASE XXXXXX 5541XXXXXX .23 0 3,456.78
RACETRAC467 00004671 PLEASANTVILLEPA
08/27 BANK FUNDS TRANSFER DB .67 0 4,816.32
TO SMITH,JOHN
SAVINGS #0001, CONF# 8675309
continued on next page>>>
987654-3210
Page 1 of 11
当前输出
['08/27', 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX ', '.23', ' ', '0', ' 3,456.78 ']
['08/27', 'BANK FUNDS TRANSFER DB ', '.67', ' ', '0', ' 4,816.32 ']
期望的输出
['08/27', 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX RACETRAC467 00004671 PLEASANTVILLEPA ', '.23', ' ', '0', ' 3,456.78 ']
['08/27', 'BANK FUNDS TRANSFER DB TO SMITH,JOHN SAVINGS #0001, CONF# 8675309 ', '.67', ' ', '0', ' 4,816.32 ']
您可以将以下行的描述(例如不以日期或“续”或页码和数字开头)添加到您已有的描述中。
在您的模式中,您使用 [\w ]*
但这也只能匹配空格。如果至少应该有一个单词字符,您可以使用 \w[\w ]*
您也可以省略此部分中的捕获组 (\s{2})
,因为它将 return 一个仅包含空格的条目。
(?P<date>\d{2}/\d{2})\s+(?P<desc>\w[\w ]*)(?P<expense>$[\d.,]*)\s{2}(?P<deposit>\d[\d.,]*)\s.*(?P<desc_more>(?:\n(?!\d+\/\d|continued\b|Page\s+\d).*)*)
模式匹配:
(?P<date>\d{2}/\d{2})
组日期\s+
匹配 1+ 个空白字符(?P<desc>\w[\w ]*)
组 desc 匹配单词字符和空格(?P<expense>$[\d.,]*)
组 expense 匹配$
和可选数字.
或,
\s{2}
匹配 2 个空白字符(?P<deposit>\d[\d.,]*)
组存款匹配一个数字和可选数字.
或,
\s.*
匹配单个空白字符和该行的其余部分(?P<desc_more>
组desc_more(?:
非捕获组整体匹配\n(?!\d+\/\d|continued\b|Page\s+\d).*
匹配一个换行符,如果它不是以类似模式或任何其他替代形式的日期开头,则匹配该行的其余部分
)*
关闭非捕获组并选择性重复
)
关闭群desc_more
看到一个regex demo and a Python demo.
使用命名捕获组和 match.groupdict()
:
import re
pattern = r"(?P<date>\d{2}/\d{2})\s+(?P<desc>\w[\w ]*)(?P<expense>$[\d.,]*)\s{2}(?P<deposit>\d[\d.,]*)\s.*(?P<desc_more>(?:\n(?!\d+\/\d|continued\b|Page\s+\d).*)*)"
s = (" 0 0 ,345.67 \n"
"08/27 DEBIT CARD PURCHASE XXXXXX 5541XXXXXX .23 0 3,456.78\n"
"RACETRAC467 00004671 PLEASANTVILLEPA\n"
"08/27 BANK FUNDS TRANSFER DB .67 0 4,816.32\n"
"TO SMITH,JOHN\n"
"SAVINGS #0001, CONF# 8675309\n"
"continued on next page>>>\n"
" 987654-3210\n"
"Page 1 of 11\n"
"07/27 DEBIT CARD PURCHASE XXXXXX 6541XXXXXX .23 0 3,456.78")
matches = re.finditer(pattern, s)
for _, match in enumerate(matches):
d = match.groupdict()
d.update({'desc': re.sub(r"[^\S\n]*\n", " " , match.groupdict().get('desc') + match.groupdict().get('desc_more'))})
del d["desc_more"]
print(d)
输出
{'date': '08/27', 'desc': 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX RACETRAC467 00004671 PLEASANTVILLEPA', 'expense': '.23', 'deposit': '0'}
{'date': '08/27', 'desc': 'BANK FUNDS TRANSFER DB TO SMITH,JOHN SAVINGS #0001, CONF# 8675309', 'expense': '.67', 'deposit': '0'}
{'date': '07/27', 'desc': 'DEBIT CARD PURCHASE XXXXXX 6541XXXXXX ', 'expense': '.23', 'deposit': '0'}