从字符串中提取数据
Data extraction from a string
我已经从扫描图像中提取表格信息(参考上面的屏幕截图)并将输出作为字符串。
这是输出
Payment Date: 9/14/2020 Reference: 0000232954
Invoice Number Invoice Date Voucher [ID Gross Amount Discounts Late Charges Paid Amount
102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86
112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96
Vendor Number Name Bank Charge Transfer Cost Cd
1000028351 OFFICE DEPOT INC [=10=].00
Reference Date Total Gross Amt Total Discounts Totul Late Charges Total Paid Amt
0000232954 Sep/14/2020 ,575.82 [=10=].00 [=10=].00 ,575.82
上面的输出是数据未对齐的字符串,我正在寻找可以将数据存储到数据框或 Excel Sheet.[=13= 中的解决方案]
上图中的列名称可以解释为 [发票编号、发票日期、凭证 ID、总金额、折扣、滞纳金、已付金额]
期待您的帮助!
您可以使用正则表达式来解析文本文档。例如:
import re
import pandas as pd
txt = """
Payment Date: 9/14/2020 Reference: 0000232954
Invoice Number Invoice Date Voucher [ID Gross Amount Discounts Late Charges Paid Amount
102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86
112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96
Vendor Number Name Bank Charge Transfer Cost Cd
1000028351 OFFICE DEPOT INC [=10=].00
Reference Date Total Gross Amt Total Discounts Totul Late Charges Total Paid Amt
0000232954 Sep/14/2020 ,575.82 [=10=].00 [=10=].00 ,575.82
"""
pat_payment_date = re.compile(r"Payment Date:\s*(\S+)")
pat_reference = re.compile(r"Reference:\s*(\S+)")
pat_items = re.compile(
r"^(\d+)\s+(\S+)\s+(\d+)\s+([\d,.]+)\s+([\d,.]+)\s+([\d,.]+)\s+([\d,.]+)",
flags=re.M,
)
pat_vendor = re.compile(
r"^Vendor.*?\n^(\d+)\s+(.*?)\s+([$\d,.]+)", flags=re.M | re.S
)
pat_last = re.compile(
r"^Reference.*?\n^(\d+)\s+(\S+)\s+([$\d,.]+)\s+([$\d,.]+)\s+([$\d,.]+)\s+([$\d,.]+)",
flags=re.M | re.S,
)
data = {}
for row in pat_payment_date.findall(txt):
data["Payment Date"] = row
for row in pat_reference.findall(txt):
data["Reference"] = row
for row in pat_items.findall(txt):
data.setdefault("Items", []).append(list(row))
for row in pat_vendor.findall(txt):
data["Vendor"] = list(row)
for row in pat_last.findall(txt):
data["Total"] = list(row)
df = pd.DataFrame([data]).explode("Items")
print(df)
打印:
Payment Date Reference Items Vendor Total
0 9/14/2020 0000232954 [102554463001, Jul/062020, 21002450, 699.86, 0.00, 0.00, 699.86] [1000028351, OFFICE DEPOT INC, [=11=].00] [0000232954, Sep/14/2020, ,575.82, [=11=].00, [=11=].00, ,575.82]
0 9/14/2020 0000232954 [112942431001, Aug/12/2020, 21002565, 875.96, 0.00, 0,00, 875.96] [1000028351, OFFICE DEPOT INC, [=11=].00] [0000232954, Sep/14/2020, ,575.82, [=11=].00, [=11=].00, ,575.82]
要从列表创建列,您可以这样做:
cols = [
"invoice_number",
"invoice_date",
"voucher_id",
"gross_amount",
"discounts",
"late_charges",
"paid_amount",
]
df = pd.concat(
[
df,
df.pop("Items")
.apply(lambda x: {c: v for v, c in zip(x, cols)})
.apply(pd.Series),
],
axis=1,
)
print(df)
打印:
Payment Date Reference Vendor Total invoice_number invoice_date voucher_id gross_amount discounts late_charges paid_amount
0 9/14/2020 0000232954 [1000028351, OFFICE DEPOT INC, [=13=].00] [0000232954, Sep/14/2020, ,575.82, [=13=].00, [=13=].00, ,575.82] 102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86
0 9/14/2020 0000232954 [1000028351, OFFICE DEPOT INC, [=13=].00] [0000232954, Sep/14/2020, ,575.82, [=13=].00, [=13=].00, ,575.82] 112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96
我已经从扫描图像中提取表格信息(参考上面的屏幕截图)并将输出作为字符串。
这是输出
Payment Date: 9/14/2020 Reference: 0000232954
Invoice Number Invoice Date Voucher [ID Gross Amount Discounts Late Charges Paid Amount
102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86
112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96
Vendor Number Name Bank Charge Transfer Cost Cd
1000028351 OFFICE DEPOT INC [=10=].00
Reference Date Total Gross Amt Total Discounts Totul Late Charges Total Paid Amt
0000232954 Sep/14/2020 ,575.82 [=10=].00 [=10=].00 ,575.82
上面的输出是数据未对齐的字符串,我正在寻找可以将数据存储到数据框或 Excel Sheet.[=13= 中的解决方案]
上图中的列名称可以解释为 [发票编号、发票日期、凭证 ID、总金额、折扣、滞纳金、已付金额]
期待您的帮助!
您可以使用正则表达式来解析文本文档。例如:
import re
import pandas as pd
txt = """
Payment Date: 9/14/2020 Reference: 0000232954
Invoice Number Invoice Date Voucher [ID Gross Amount Discounts Late Charges Paid Amount
102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86
112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96
Vendor Number Name Bank Charge Transfer Cost Cd
1000028351 OFFICE DEPOT INC [=10=].00
Reference Date Total Gross Amt Total Discounts Totul Late Charges Total Paid Amt
0000232954 Sep/14/2020 ,575.82 [=10=].00 [=10=].00 ,575.82
"""
pat_payment_date = re.compile(r"Payment Date:\s*(\S+)")
pat_reference = re.compile(r"Reference:\s*(\S+)")
pat_items = re.compile(
r"^(\d+)\s+(\S+)\s+(\d+)\s+([\d,.]+)\s+([\d,.]+)\s+([\d,.]+)\s+([\d,.]+)",
flags=re.M,
)
pat_vendor = re.compile(
r"^Vendor.*?\n^(\d+)\s+(.*?)\s+([$\d,.]+)", flags=re.M | re.S
)
pat_last = re.compile(
r"^Reference.*?\n^(\d+)\s+(\S+)\s+([$\d,.]+)\s+([$\d,.]+)\s+([$\d,.]+)\s+([$\d,.]+)",
flags=re.M | re.S,
)
data = {}
for row in pat_payment_date.findall(txt):
data["Payment Date"] = row
for row in pat_reference.findall(txt):
data["Reference"] = row
for row in pat_items.findall(txt):
data.setdefault("Items", []).append(list(row))
for row in pat_vendor.findall(txt):
data["Vendor"] = list(row)
for row in pat_last.findall(txt):
data["Total"] = list(row)
df = pd.DataFrame([data]).explode("Items")
print(df)
打印:
Payment Date Reference Items Vendor Total
0 9/14/2020 0000232954 [102554463001, Jul/062020, 21002450, 699.86, 0.00, 0.00, 699.86] [1000028351, OFFICE DEPOT INC, [=11=].00] [0000232954, Sep/14/2020, ,575.82, [=11=].00, [=11=].00, ,575.82]
0 9/14/2020 0000232954 [112942431001, Aug/12/2020, 21002565, 875.96, 0.00, 0,00, 875.96] [1000028351, OFFICE DEPOT INC, [=11=].00] [0000232954, Sep/14/2020, ,575.82, [=11=].00, [=11=].00, ,575.82]
要从列表创建列,您可以这样做:
cols = [
"invoice_number",
"invoice_date",
"voucher_id",
"gross_amount",
"discounts",
"late_charges",
"paid_amount",
]
df = pd.concat(
[
df,
df.pop("Items")
.apply(lambda x: {c: v for v, c in zip(x, cols)})
.apply(pd.Series),
],
axis=1,
)
print(df)
打印:
Payment Date Reference Vendor Total invoice_number invoice_date voucher_id gross_amount discounts late_charges paid_amount
0 9/14/2020 0000232954 [1000028351, OFFICE DEPOT INC, [=13=].00] [0000232954, Sep/14/2020, ,575.82, [=13=].00, [=13=].00, ,575.82] 102554463001 Jul/062020 21002450 699.86 0.00 0.00 699.86
0 9/14/2020 0000232954 [1000028351, OFFICE DEPOT INC, [=13=].00] [0000232954, Sep/14/2020, ,575.82, [=13=].00, [=13=].00, ,575.82] 112942431001 Aug/12/2020 21002565 875.96 0.00 0,00 875.96