Python 中的字符串列表中的正则表达式
Regex from a list of strings in Python
我有一个名为 Statement 的列表,使用 pytesseract 和 Regex 从 pdf 创建:
Statement= ['07-10-2019 UPI/927912685773/UPI/surya.balaji94@/Citibank 6,677.00 2,36,804.08',
'07-10-2019 MOBILE BANKING DUT/CITIBANK 3,403.00 2,40,207.08',
'07-10-2019 BIL/INFT/001818195728/82D3/ AJAY KUMAR JHA 6,080.00 2,46, 287.08',
'08.10.2019 MOBILE BANKING MMM TiMPS/928115182374/8161 Oct Mte/AMARJEET SIHDFC 4,411.00 250,698.08',
'08-10-2019 BIL/INFT/001818636132/E3 BIk1 Pramod/ PRAMOD KUMAR P 6,599.00 2,57,297.08']
借助堆栈的一些帮助,我创建了一个字典列表,如下所示:
cols = ["Date", "Item_Name", "Transaction_Amount", "Balance"]
date_pattern = re.compile(r"\d{2}[- /.]\d{2}[- /.]\d{4}", re.I)
item_and_name_pattern = re.compile(r"(?<=\d{2}-\d{2}-\d{4}\s).*", re.I)
amount_pattern = re.compile(r"\d+,\d+.\d+", re.I)
total_pattern = re.compile(r"\d+,\d+,\d+.\d+$", re.I)
Transaction = namedtuple("Transaction", cols)
transactions = []
for item in Statement:
try:
date = re.search(date_pattern, item).group()
total = re.search(total_pattern, item).group()
temp_1 = item.rstrip(total)
amount = re.search(amount_pattern, item).group()
temp_2 = temp_1.strip().rstrip(amount)
item_and_name = re.search(item_and_name_pattern, temp_2).group()
except:
pass
t = Transaction(date, item_and_name, amount, total)
transactions.append(t)
out = [{k:v for k, v in f._asdict().items()} for f in transactions]
但输出并不令人满意,因为它正在获取日期,但该日期的项目名称和总计等出错(检查上面的列表并与下面的词典匹配)。我想知道是否有任何其他方法可以将它们正确存储在命名列中?
[{'Date': '07-10-2019',
'Item_Name': 'UPI/927912685773/UPI/surya.balaji94@/Citibank ',
'Transaction_Amount': '6,677.00',
'Balance': '2,36,804.08'},
{'Date': '07-10-2019',
'Item_Name': 'MOBILE BANKING DUT/CITIBANK ',
'Transaction_Amount': '3,403.00',
'Balance': '2,40,207.08'},
{'Date': '07-10-2019',
'Item_Name': 'MOBILE BANKING DUT/CITIBANK ',
'Transaction_Amount': '3,403.00',
'Balance': '2,40,207.08'},
{'Date': '08.10.2019',
'Item_Name': 'MOBILE BANKING DUT/CITIBANK ',
'Transaction_Amount': '3,403.00',
'Balance': '2,40,207.08'},
{'Date': '08-10-2019',
'Item_Name': 'BIL/INFT/001818636132/E3 BIk1 Pramod/ PRAMOD KUMAR P ',
'Transaction_Amount': '6,599.00',
'Balance': '2,57,297.08'}]
这里有一个更简单的方法:
import re
pattern = re.compile("(?P<Date>\d{2}[.-]\d{2}[.-]\d{4})\s(?P<Item_Name>.+)\s(?P<Transaction_Amount>[0-9,\.]+)\s(?P<Balance>[0-9,\.]+)")
print([pattern.match(item).groupdict() for item in Statement])
编辑:如果按照评论中的要求使用 try-except:
result = []
for item in Statement:
try:
result.append(pattern.match(item).groupdict())
except AttributeError:
pass
print(result)
我有一个名为 Statement 的列表,使用 pytesseract 和 Regex 从 pdf 创建:
Statement= ['07-10-2019 UPI/927912685773/UPI/surya.balaji94@/Citibank 6,677.00 2,36,804.08',
'07-10-2019 MOBILE BANKING DUT/CITIBANK 3,403.00 2,40,207.08',
'07-10-2019 BIL/INFT/001818195728/82D3/ AJAY KUMAR JHA 6,080.00 2,46, 287.08',
'08.10.2019 MOBILE BANKING MMM TiMPS/928115182374/8161 Oct Mte/AMARJEET SIHDFC 4,411.00 250,698.08',
'08-10-2019 BIL/INFT/001818636132/E3 BIk1 Pramod/ PRAMOD KUMAR P 6,599.00 2,57,297.08']
借助堆栈的一些帮助,我创建了一个字典列表,如下所示:
cols = ["Date", "Item_Name", "Transaction_Amount", "Balance"]
date_pattern = re.compile(r"\d{2}[- /.]\d{2}[- /.]\d{4}", re.I)
item_and_name_pattern = re.compile(r"(?<=\d{2}-\d{2}-\d{4}\s).*", re.I)
amount_pattern = re.compile(r"\d+,\d+.\d+", re.I)
total_pattern = re.compile(r"\d+,\d+,\d+.\d+$", re.I)
Transaction = namedtuple("Transaction", cols)
transactions = []
for item in Statement:
try:
date = re.search(date_pattern, item).group()
total = re.search(total_pattern, item).group()
temp_1 = item.rstrip(total)
amount = re.search(amount_pattern, item).group()
temp_2 = temp_1.strip().rstrip(amount)
item_and_name = re.search(item_and_name_pattern, temp_2).group()
except:
pass
t = Transaction(date, item_and_name, amount, total)
transactions.append(t)
out = [{k:v for k, v in f._asdict().items()} for f in transactions]
但输出并不令人满意,因为它正在获取日期,但该日期的项目名称和总计等出错(检查上面的列表并与下面的词典匹配)。我想知道是否有任何其他方法可以将它们正确存储在命名列中?
[{'Date': '07-10-2019',
'Item_Name': 'UPI/927912685773/UPI/surya.balaji94@/Citibank ',
'Transaction_Amount': '6,677.00',
'Balance': '2,36,804.08'},
{'Date': '07-10-2019',
'Item_Name': 'MOBILE BANKING DUT/CITIBANK ',
'Transaction_Amount': '3,403.00',
'Balance': '2,40,207.08'},
{'Date': '07-10-2019',
'Item_Name': 'MOBILE BANKING DUT/CITIBANK ',
'Transaction_Amount': '3,403.00',
'Balance': '2,40,207.08'},
{'Date': '08.10.2019',
'Item_Name': 'MOBILE BANKING DUT/CITIBANK ',
'Transaction_Amount': '3,403.00',
'Balance': '2,40,207.08'},
{'Date': '08-10-2019',
'Item_Name': 'BIL/INFT/001818636132/E3 BIk1 Pramod/ PRAMOD KUMAR P ',
'Transaction_Amount': '6,599.00',
'Balance': '2,57,297.08'}]
这里有一个更简单的方法:
import re
pattern = re.compile("(?P<Date>\d{2}[.-]\d{2}[.-]\d{4})\s(?P<Item_Name>.+)\s(?P<Transaction_Amount>[0-9,\.]+)\s(?P<Balance>[0-9,\.]+)")
print([pattern.match(item).groupdict() for item in Statement])
编辑:如果按照评论中的要求使用 try-except:
result = []
for item in Statement:
try:
result.append(pattern.match(item).groupdict())
except AttributeError:
pass
print(result)