如何从 multi-line 字符串中提取特定信息

Question

我已经从电子邮件 body 到 Python 字符串中提取了一些发票相关信息，我的下一个任务是从字符串中提取发票编号。电子邮件的格式可能会有所不同，因此很难从文本中找到发票编号。我还尝试了 SpaCy 的 "Named Entity Recognition"，但由于在大多数情况下，发票编号来自标题 'Invoice' 或 'Invoice#' 的下一行，NER 不理解关系和 returns 不正确的细节。

以下是从邮件 body 中提取的 2 个文本示例：

示例 - 1。

Dear Customer:
The past due invoices listed below are still pending. This includes the 
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.

示例 - 2.

Hi - please confirm the status of below two invoices.

Invoice#               Amount               Invoice Date       Due Date          
7651234                ,579.06          29-Jan-19           28-Apr-19            
9872341                ,137.20          27-Feb-19           26-Apr-19

我的问题是，如果我将整个文本转换为单个字符串，那么它就会变成这样：

Invoice   Date     Purchase Order  Due Date  Balance 8754321   8/17/17 
7200016508     9/16/18   140.72

可见发票编号（本例中为8754321）改变了位置，不再跟在关键字"Invoice"后面，更难找到。

我想要的输出是这样的：

Output Example - 1 - 

8754321
5245344

Output Example - 2 - 

7651234                
9872341

我不知道如何检索关键字 "Invoice" 或 "Invoice#" 下的文本，这是发票编号。

如果需要更多信息，请告诉我。谢谢！！

编辑：发票编号没有任何 pre-defined 长度，可以是 7 位或更多。

Answer 1

偏离 Andrew Allen 所说的，只要这两个假设是正确的：

发票编号始终正好是 7 位数字
发票编号总是跟在一个空格后面，后面跟着一个空格

使用正则表达式应该可以。类似于;

import re

email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the 
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.'''

invoices = re.findall(r'\s(\d\d\d\d\d\d\d)\s', email)

invoice 在这种情况下有 2 个字符串的列表，['8754321', '5245344']

Answer 2

使用正则表达式。 re.findall

例如：

import re

email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the 
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.'''

email2 = """Hi - please confirm the status of below two invoices.

Invoice#               Amount               Invoice Date       Due Date          
7651234                ,579.06          29-Jan-19           28-Apr-19            
9872341                ,137.20          27-Feb-19           26-Apr-19 """

for eml in [email, email2]:
    print(re.findall(r"\b\d{7}\b", eml, flags=re.DOTALL))

输出：

['8754321', '5245344']
['7651234', '9872341']

\b - 正则表达式边界
\d{7} - 得到 7 位数字

Answer 3

代码根据我的评论。

email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the 
following:

Invoice   Date     Purchase Order  Due Date  Balance
8754321   8/17/17  7200016508      9/16/18   140.72
5245344   11/7/17  4500199620      12/7/18   301.54

We would appreciate quick payment of these invoices.'''

index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
    if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
        print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
        index = line.find('Invoice')

使用试探法，即行 header 列始终为驼峰式或大写字母 (ID)。如果说标题正好是 'Account no.' 而不是 'Account No.'

，这将失败

# get all number at a certain index
for line in email.split('\n'):
     words = line[index:].split()
     if words == []: continue
     word = words[0]
     try:
         print(int(word))
     except:
         continue

此处的可靠性取决于数据。所以在我的代码中，发票列必须是 table header 的第一个。也就是说，在 'Invoice' 之前不能有 'Invoice Date'。显然这需要修复。

如何从 multi-line 字符串中提取特定信息

How to extract specific information from multi-line string

python

information-extraction