是否可以使用换行符拆分 PDF 文件的内容?

Is it possible to split the content of a PDF file with line breaks in it?

我有一个 PDF 文件,我想从中提取数据。目前,我正在按行拆分文本并将其存储到 list 中。我想知道是否可以通过粗体换行符以某种方式 split 它并将其存储在 list?

该粗线是每个块的分隔符,因此如果可能的话,从该文件中提取数据将很容易。

我想要的输出是这样的:

['DISTRICT ROW LLC', 'Premises No.: 0', 'License Key: 0', 'Date Entered:09/08/2021', 'Tradename: OLSEN RUN WINERY', 'Email Address: rachel@olsenrun.com', 'License Type/Action:  F-COM/ N/O', ]

当前代码:

import re
import pdftotext
import csv


with open('data.csv', 'a+', newline='', encoding='utf-8') as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_headers = [
        'Name', 'Date Entered', 'Tradename', 'Address', 'Email Address', 'License Type/Action' 
    ]
    csv_writer.writerow(csv_headers)
        
    with open('pdf_file.pdf', 'rb') as pdf_file:
        pdf = pdftotext.PDF(pdf_file)
    
    lines = pdf[0][313:].split('\n')
    new_list = []
    count = 1
    for line in lines:
        new_list.append(line)
        count += 1
        if count % 8 == 0:
            name = ''
            date_entered = re.findall(r'Date Entered: (.[0-9]\/.[0-9]\/.[0-9].[0-9]?)', "".join(new_list))
            trade_name = re.findall(r'Tradename: (.*[A-Z]?)(?<=  ).*', "".join(new_list))
            address = re.findall(r'Tradename: (.*[A-Z]?)(?<=  ).*', "".join(new_list))
            email = ''
            license_type_or_action = ''
            
            new_list.clear()

一些输出:

['DISTRICT ROW LLC\r', '   Premises No.: 0           License Key:   0                                  Date Entered: 09/08/2021\r', '    Tradename: OLSEN RUN WINERY                                               Date Received: 09/03/2021\r', '        Address: 32900 DIAMOND HILL DR, HARRISBURG 97446\r', '  Email Address: rachel@olsenrun.com\r', 'License Type/Action: F-COM / N/O\r', '\r']

pdf,here。我试着用 PyDF4 打开你遇到了一个烦人的问题,PdfReadWarning: Superfluous whitespace found in object header [...] 仍然没有用 PyPDF 解决,可能是由于文件质量差。所以我使用 pdftotext.

从 shell 转换为文本

我试图用正则表达式标准找到每个块的起始索引。自动你也有结束,这是下一个块的索引(负 1)。

一旦有了开始和结束索引,相应的切片就会成为一个块。

import re

path_pdf = #

with open(path_pdf, 'r') as fd:
    text = fd.read()

header = """Report Date: 9/14/2021

Oregon Liquor & Cannabis Commission

Page {} of 5

Weekly Applications Received
For Entry Dates: 09/04/2021 Through 09/10/2021"""

# globally remove header - the header depends on the page number
text_header_less = text
for i in range(1, 6):
    text_header_less = text_header_less.replace(header.format(i), '')


text_header_less_lines = text_header_less.split('\n')

company_name_pattern = re.compile(r'^([A-Z]{3,})')  # at least 3 consecutive capital letters...could be better:)
start_location_company_datas = []
for i, l in enumerate(text_header_less_lines):
    if company_name_pattern.search(l) is not None:
        start_location_company_datas += [i]


company_data = []
for start, end in zip(start_location_company_datas[:-1], start_location_company_datas[1:]):
    # ! contain still empty lines - to be cleaned?
    #company_data += ['\n'.join(text_header_less_lines[start-1: end])]  # as a string
    company_data += [[text_header_less_lines[start-1: end] ]]   # as a list


for i in company_data[:2]:
    print(i)
    print('-'*20)

输出

[['', 'DISTRICT ROW LLC', 'Premises No.: 0', '', 'License Key:', '', '0', '', 'Tradename: OLSEN RUN WINERY', 'Address: 32900 DIAMOND HILL DR, HARRISBURG 97446', 'Email Address: rachel@olsenrun.com', 'License Type/Action: F-COM / N/O', '', 'Date Entered: 09/08/2021', 'Date Received: 09/03/2021', ''], ...]

备注:

  • 可以通过删除空行、多余的空格来清理最终数据...这取决于您要查找的内容
  • 无法自动对分隔符进行分组,因为在提取文本时不可见