Python 多行正则表达式

Python multi-line regex

我正在使用 pdfplumber.page.extract_text() 从银行对帐单中提取文本。文本似乎已正确提取,但我在使用正则表达式提取日期、类型、描述和数量时遇到问题。但我想不出一种干净的方法来捕获多行描述。我希望将金色框内的描述文本与金色框前一行中的描述文本分组。

正则表达式

re.findall(r'(\d{2}\/\d{2})\s*([\w ]*)([$\d.,]*)(\s{2})([$\d.,]*).*\s(?=\w*)', text)

正则表达式描述

(\d{2}\/\d{2}) - Capture date
([\w ]*) - Capture description
([$\d.,]*) - Capture expense amount
([$\d.,]*) - Capture deposit amount
(?=\w*) - Positive Lookahead for any text below

输入

  0  0  ,345.67 
08/27  DEBIT CARD PURCHASE XXXXXX 5541XXXXXX  .23  0  3,456.78
RACETRAC467 00004671 PLEASANTVILLEPA
08/27  BANK FUNDS TRANSFER DB  .67  0  4,816.32
TO SMITH,JOHN
SAVINGS #0001, CONF# 8675309
continued on next page>>>
 987654-3210
Page 1 of 11

当前输出

['08/27', 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX  ', '.23', '  ', '0', '  3,456.78 ']
['08/27', 'BANK FUNDS TRANSFER DB  ', '.67', '  ', '0', '  4,816.32 ']

期望的输出

['08/27', 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX RACETRAC467 00004671 PLEASANTVILLEPA  ', '.23', '  ', '0', '  3,456.78 ']
['08/27', 'BANK FUNDS TRANSFER DB TO SMITH,JOHN SAVINGS #0001, CONF# 8675309 ', '.67', '  ', '0', ' 4,816.32 ']

您可以将以下行的描述(例如不以日期或“续”或页码和数字开头)添加到您已有的描述中。

在您的模式中,您使用 [\w ]* 但这也只能匹配空格。如果至少应该有一个单词字符,您可以使用 \w[\w ]*

您也可以省略此部分中的捕获组 (\s{2}),因为它将 return 一个仅包含空格的条目。

(?P<date>\d{2}/\d{2})\s+(?P<desc>\w[\w ]*)(?P<expense>$[\d.,]*)\s{2}(?P<deposit>\d[\d.,]*)\s.*(?P<desc_more>(?:\n(?!\d+\/\d|continued\b|Page\s+\d).*)*)

模式匹配:

  • (?P<date>\d{2}/\d{2})日期
  • \s+ 匹配 1+ 个空白字符
  • (?P<desc>\w[\w ]*)desc 匹配单词字符和空格
  • (?P<expense>$[\d.,]*)expense 匹配 $ 和可选数字 .,
  • \s{2} 匹配 2 个空白字符
  • (?P<deposit>\d[\d.,]*)存款匹配一个数字和可选数字.,
  • \s.* 匹配单个空白字符和该行的其余部分
  • (?P<desc_more>desc_more
    • (?:非捕获组整体匹配
      • \n(?!\d+\/\d|continued\b|Page\s+\d).* 匹配一个换行符,如果它不是以类似模式或任何其他替代形式的日期开头,则匹配该行的其余部分
    • )*关闭非捕获组并选择性重复
  • )关闭群desc_more

看到一个regex demo and a Python demo.

使用命名捕获组和 match.groupdict():

的示例
import re

pattern = r"(?P<date>\d{2}/\d{2})\s+(?P<desc>\w[\w ]*)(?P<expense>$[\d.,]*)\s{2}(?P<deposit>\d[\d.,]*)\s.*(?P<desc_more>(?:\n(?!\d+\/\d|continued\b|Page\s+\d).*)*)"

s = ("  0  0  ,345.67 \n"
     "08/27  DEBIT CARD PURCHASE XXXXXX 5541XXXXXX  .23  0  3,456.78\n"
     "RACETRAC467 00004671 PLEASANTVILLEPA\n"
     "08/27  BANK FUNDS TRANSFER DB  .67  0  4,816.32\n"
     "TO SMITH,JOHN\n"
     "SAVINGS #0001, CONF# 8675309\n"
     "continued on next page>>>\n"
     " 987654-3210\n"
     "Page 1 of 11\n"
     "07/27  DEBIT CARD PURCHASE XXXXXX 6541XXXXXX  .23  0  3,456.78")

matches = re.finditer(pattern, s)

for _, match in enumerate(matches):
    d = match.groupdict()
    d.update({'desc': re.sub(r"[^\S\n]*\n", " " , match.groupdict().get('desc') + match.groupdict().get('desc_more'))})
    del d["desc_more"]
    print(d)

输出

{'date': '08/27', 'desc': 'DEBIT CARD PURCHASE XXXXXX 5541XXXXXX RACETRAC467 00004671 PLEASANTVILLEPA', 'expense': '.23', 'deposit': '0'}
{'date': '08/27', 'desc': 'BANK FUNDS TRANSFER DB TO SMITH,JOHN SAVINGS #0001, CONF# 8675309', 'expense': '.67', 'deposit': '0'}
{'date': '07/27', 'desc': 'DEBIT CARD PURCHASE XXXXXX 6541XXXXXX  ', 'expense': '.23', 'deposit': '0'}