如何从 multi-line 字符串中提取特定信息
How to extract specific information from multi-line string
我已经从电子邮件 body 到 Python 字符串中提取了一些发票相关信息,我的下一个任务是从字符串中提取发票编号。
电子邮件的格式可能会有所不同,因此很难从文本中找到发票编号。我还尝试了 SpaCy 的 "Named Entity Recognition",但由于在大多数情况下,发票编号来自标题 'Invoice' 或 'Invoice#' 的下一行,NER 不理解关系和 returns 不正确的细节。
以下是从邮件 body 中提取的 2 个文本示例:
示例 - 1。
Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.
示例 - 2.
Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 ,579.06 29-Jan-19 28-Apr-19
9872341 ,137.20 27-Feb-19 26-Apr-19
我的问题是,如果我将整个文本转换为单个字符串,那么它就会变成这样:
Invoice Date Purchase Order Due Date Balance 8754321 8/17/17
7200016508 9/16/18 140.72
可见发票编号(本例中为8754321)改变了位置,不再跟在关键字"Invoice"后面,更难找到。
我想要的输出是这样的:
Output Example - 1 -
8754321
5245344
Output Example - 2 -
7651234
9872341
我不知道如何检索关键字 "Invoice" 或 "Invoice#" 下的文本,这是发票编号。
如果需要更多信息,请告诉我。谢谢!!
编辑:发票编号没有任何 pre-defined 长度,可以是 7 位或更多。
偏离 Andrew Allen 所说的,只要这两个假设是正确的:
- 发票编号始终正好是 7 位数字
- 发票编号总是跟在一个空格后面,后面跟着一个空格
使用正则表达式应该可以。类似于;
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
invoices = re.findall(r'\s(\d\d\d\d\d\d\d)\s', email)
invoice
在这种情况下有 2 个字符串的列表,['8754321', '5245344']
使用正则表达式。 re.findall
例如:
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
email2 = """Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 ,579.06 29-Jan-19 28-Apr-19
9872341 ,137.20 27-Feb-19 26-Apr-19 """
for eml in [email, email2]:
print(re.findall(r"\b\d{7}\b", eml, flags=re.DOTALL))
输出:
['8754321', '5245344']
['7651234', '9872341']
\b
- 正则表达式边界
\d{7}
- 得到 7 位数字
代码根据我的评论。
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
index = line.find('Invoice')
使用试探法,即行 header 列始终为驼峰式或大写字母 (ID)。如果说标题正好是 'Account no.' 而不是 'Account No.'
,这将失败
# get all number at a certain index
for line in email.split('\n'):
words = line[index:].split()
if words == []: continue
word = words[0]
try:
print(int(word))
except:
continue
此处的可靠性取决于数据。所以在我的代码中,发票列必须是 table header 的第一个。也就是说,在 'Invoice' 之前不能有 'Invoice Date'。显然这需要修复。
我已经从电子邮件 body 到 Python 字符串中提取了一些发票相关信息,我的下一个任务是从字符串中提取发票编号。 电子邮件的格式可能会有所不同,因此很难从文本中找到发票编号。我还尝试了 SpaCy 的 "Named Entity Recognition",但由于在大多数情况下,发票编号来自标题 'Invoice' 或 'Invoice#' 的下一行,NER 不理解关系和 returns 不正确的细节。
以下是从邮件 body 中提取的 2 个文本示例:
示例 - 1。
Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.
示例 - 2.
Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 ,579.06 29-Jan-19 28-Apr-19
9872341 ,137.20 27-Feb-19 26-Apr-19
我的问题是,如果我将整个文本转换为单个字符串,那么它就会变成这样:
Invoice Date Purchase Order Due Date Balance 8754321 8/17/17
7200016508 9/16/18 140.72
可见发票编号(本例中为8754321)改变了位置,不再跟在关键字"Invoice"后面,更难找到。
我想要的输出是这样的:
Output Example - 1 -
8754321
5245344
Output Example - 2 -
7651234
9872341
我不知道如何检索关键字 "Invoice" 或 "Invoice#" 下的文本,这是发票编号。
如果需要更多信息,请告诉我。谢谢!!
编辑:发票编号没有任何 pre-defined 长度,可以是 7 位或更多。
偏离 Andrew Allen 所说的,只要这两个假设是正确的:
- 发票编号始终正好是 7 位数字
- 发票编号总是跟在一个空格后面,后面跟着一个空格
使用正则表达式应该可以。类似于;
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
invoices = re.findall(r'\s(\d\d\d\d\d\d\d)\s', email)
invoice
在这种情况下有 2 个字符串的列表,['8754321', '5245344']
使用正则表达式。 re.findall
例如:
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
email2 = """Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 ,579.06 29-Jan-19 28-Apr-19
9872341 ,137.20 27-Feb-19 26-Apr-19 """
for eml in [email, email2]:
print(re.findall(r"\b\d{7}\b", eml, flags=re.DOTALL))
输出:
['8754321', '5245344']
['7651234', '9872341']
\b
- 正则表达式边界\d{7}
- 得到 7 位数字
代码根据我的评论。
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
index = line.find('Invoice')
使用试探法,即行 header 列始终为驼峰式或大写字母 (ID)。如果说标题正好是 'Account no.' 而不是 'Account No.'
,这将失败# get all number at a certain index
for line in email.split('\n'):
words = line[index:].split()
if words == []: continue
word = words[0]
try:
print(int(word))
except:
continue
此处的可靠性取决于数据。所以在我的代码中,发票列必须是 table header 的第一个。也就是说,在 'Invoice' 之前不能有 'Invoice Date'。显然这需要修复。