从字符串中提取数据

Data extraction from a string

我已经从扫描图像中提取表格信息(参考上面的屏幕截图)并将输出作为字符串。

这是输出

Payment Date: 9/14/2020 Reference: 0000232954                                                                
Invoice Number Invoice Date Voucher [ID Gross Amount Discounts Late Charges Paid Amount
102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86                                                             
112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96                                                            
Vendor Number Name Bank Charge Transfer Cost Cd                                                              
1000028351 OFFICE DEPOT INC [=10=].00                                                          
Reference Date Total Gross Amt Total Discounts Totul Late Charges Total Paid Amt
                                                                   
0000232954 Sep/14/2020 ,575.82 [=10=].00 [=10=].00 ,575.82

上面的输出是数据未对齐的字符串,我正在寻找可以将数据存储到数据框或 Excel Sheet.[=13= 中的解决方案]

上图中的列名称可以解释为 [发票编号、发票日期、凭证 ID、总金额、折扣、滞纳金、已付金额]

期待您的帮助!

您可以使用正则表达式来解析文本文档。例如:

import re
import pandas as pd

txt = """
Payment Date: 9/14/2020 Reference: 0000232954                                                                
Invoice Number Invoice Date Voucher [ID Gross Amount Discounts Late Charges Paid Amount
102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86                                                             
112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96                                                            
Vendor Number Name Bank Charge Transfer Cost Cd                                                              
1000028351 OFFICE DEPOT INC [=10=].00                                                          
Reference Date Total Gross Amt Total Discounts Totul Late Charges Total Paid Amt
                                                                   
0000232954 Sep/14/2020 ,575.82 [=10=].00 [=10=].00 ,575.82
"""

pat_payment_date = re.compile(r"Payment Date:\s*(\S+)")
pat_reference = re.compile(r"Reference:\s*(\S+)")

pat_items = re.compile(
    r"^(\d+)\s+(\S+)\s+(\d+)\s+([\d,.]+)\s+([\d,.]+)\s+([\d,.]+)\s+([\d,.]+)",
    flags=re.M,
)
pat_vendor = re.compile(
    r"^Vendor.*?\n^(\d+)\s+(.*?)\s+([$\d,.]+)", flags=re.M | re.S
)
pat_last = re.compile(
    r"^Reference.*?\n^(\d+)\s+(\S+)\s+([$\d,.]+)\s+([$\d,.]+)\s+([$\d,.]+)\s+([$\d,.]+)",
    flags=re.M | re.S,
)

data = {}
for row in pat_payment_date.findall(txt):
    data["Payment Date"] = row
for row in pat_reference.findall(txt):
    data["Reference"] = row
for row in pat_items.findall(txt):
    data.setdefault("Items", []).append(list(row))
for row in pat_vendor.findall(txt):
    data["Vendor"] = list(row)
for row in pat_last.findall(txt):
    data["Total"] = list(row)

df = pd.DataFrame([data]).explode("Items")
print(df)

打印:

  Payment Date   Reference                                                              Items                                 Vendor                                                          Total
0    9/14/2020  0000232954   [102554463001, Jul/062020, 21002450, 699.86, 0.00, 0.00, 699.86]  [1000028351, OFFICE DEPOT INC, [=11=].00]  [0000232954, Sep/14/2020, ,575.82, [=11=].00, [=11=].00, ,575.82]
0    9/14/2020  0000232954  [112942431001, Aug/12/2020, 21002565, 875.96, 0.00, 0,00, 875.96]  [1000028351, OFFICE DEPOT INC, [=11=].00]  [0000232954, Sep/14/2020, ,575.82, [=11=].00, [=11=].00, ,575.82]

要从列表创建列,您可以这样做:

cols = [
    "invoice_number",
    "invoice_date",
    "voucher_id",
    "gross_amount",
    "discounts",
    "late_charges",
    "paid_amount",
]
df = pd.concat(
    [
        df,
        df.pop("Items")
        .apply(lambda x: {c: v for v, c in zip(x, cols)})
        .apply(pd.Series),
    ],
    axis=1,
)
print(df)

打印:

  Payment Date   Reference                                 Vendor                                                          Total invoice_number invoice_date voucher_id gross_amount discounts late_charges paid_amount
0    9/14/2020  0000232954  [1000028351, OFFICE DEPOT INC, [=13=].00]  [0000232954, Sep/14/2020, ,575.82, [=13=].00, [=13=].00, ,575.82]   102554463001   Jul/062020   21002450       699.86      0.00         0.00      699.86
0    9/14/2020  0000232954  [1000028351, OFFICE DEPOT INC, [=13=].00]  [0000232954, Sep/14/2020, ,575.82, [=13=].00, [=13=].00, ,575.82]   112942431001  Aug/12/2020   21002565       875.96      0.00         0,00      875.96