是否可以使用换行符拆分 PDF 文件的内容?
Is it possible to split the content of a PDF file with line breaks in it?
我有一个 PDF
文件,我想从中提取数据。目前,我正在按行拆分文本并将其存储到 list
中。我想知道是否可以通过粗体换行符以某种方式 split
它并将其存储在 list
?
该粗线是每个块的分隔符,因此如果可能的话,从该文件中提取数据将很容易。
我想要的输出是这样的:
['DISTRICT ROW LLC', 'Premises No.: 0', 'License Key: 0', 'Date Entered:09/08/2021', 'Tradename: OLSEN RUN WINERY', 'Email Address: rachel@olsenrun.com', 'License Type/Action: F-COM/ N/O', ]
当前代码:
import re
import pdftotext
import csv
with open('data.csv', 'a+', newline='', encoding='utf-8') as csv_file:
csv_writer = csv.writer(csv_file)
csv_headers = [
'Name', 'Date Entered', 'Tradename', 'Address', 'Email Address', 'License Type/Action'
]
csv_writer.writerow(csv_headers)
with open('pdf_file.pdf', 'rb') as pdf_file:
pdf = pdftotext.PDF(pdf_file)
lines = pdf[0][313:].split('\n')
new_list = []
count = 1
for line in lines:
new_list.append(line)
count += 1
if count % 8 == 0:
name = ''
date_entered = re.findall(r'Date Entered: (.[0-9]\/.[0-9]\/.[0-9].[0-9]?)', "".join(new_list))
trade_name = re.findall(r'Tradename: (.*[A-Z]?)(?<= ).*', "".join(new_list))
address = re.findall(r'Tradename: (.*[A-Z]?)(?<= ).*', "".join(new_list))
email = ''
license_type_or_action = ''
new_list.clear()
一些输出:
['DISTRICT ROW LLC\r', ' Premises No.: 0 License Key: 0 Date Entered: 09/08/2021\r', ' Tradename: OLSEN RUN WINERY Date Received: 09/03/2021\r', ' Address: 32900 DIAMOND HILL DR, HARRISBURG 97446\r', ' Email Address: rachel@olsenrun.com\r', 'License Type/Action: F-COM / N/O\r', '\r']
pdf,here。我试着用 PyDF4
打开你遇到了一个烦人的问题,PdfReadWarning: Superfluous whitespace found in object header [...]
仍然没有用 PyPDF 解决,可能是由于文件质量差。所以我使用 pdftotext
.
从 shell 转换为文本
我试图用正则表达式标准找到每个块的起始索引。自动你也有结束,这是下一个块的索引(负 1)。
一旦有了开始和结束索引,相应的切片就会成为一个块。
import re
path_pdf = #
with open(path_pdf, 'r') as fd:
text = fd.read()
header = """Report Date: 9/14/2021
Oregon Liquor & Cannabis Commission
Page {} of 5
Weekly Applications Received
For Entry Dates: 09/04/2021 Through 09/10/2021"""
# globally remove header - the header depends on the page number
text_header_less = text
for i in range(1, 6):
text_header_less = text_header_less.replace(header.format(i), '')
text_header_less_lines = text_header_less.split('\n')
company_name_pattern = re.compile(r'^([A-Z]{3,})') # at least 3 consecutive capital letters...could be better:)
start_location_company_datas = []
for i, l in enumerate(text_header_less_lines):
if company_name_pattern.search(l) is not None:
start_location_company_datas += [i]
company_data = []
for start, end in zip(start_location_company_datas[:-1], start_location_company_datas[1:]):
# ! contain still empty lines - to be cleaned?
#company_data += ['\n'.join(text_header_less_lines[start-1: end])] # as a string
company_data += [[text_header_less_lines[start-1: end] ]] # as a list
for i in company_data[:2]:
print(i)
print('-'*20)
输出
[['', 'DISTRICT ROW LLC', 'Premises No.: 0', '', 'License Key:', '', '0', '', 'Tradename: OLSEN RUN WINERY', 'Address: 32900 DIAMOND HILL DR, HARRISBURG 97446', 'Email Address: rachel@olsenrun.com', 'License Type/Action: F-COM / N/O', '', 'Date Entered: 09/08/2021', 'Date Received: 09/03/2021', ''], ...]
备注:
- 可以通过删除空行、多余的空格来清理最终数据...这取决于您要查找的内容
- 无法自动对分隔符进行分组,因为在提取文本时不可见
我有一个 PDF
文件,我想从中提取数据。目前,我正在按行拆分文本并将其存储到 list
中。我想知道是否可以通过粗体换行符以某种方式 split
它并将其存储在 list
?
该粗线是每个块的分隔符,因此如果可能的话,从该文件中提取数据将很容易。
我想要的输出是这样的:
['DISTRICT ROW LLC', 'Premises No.: 0', 'License Key: 0', 'Date Entered:09/08/2021', 'Tradename: OLSEN RUN WINERY', 'Email Address: rachel@olsenrun.com', 'License Type/Action: F-COM/ N/O', ]
当前代码:
import re
import pdftotext
import csv
with open('data.csv', 'a+', newline='', encoding='utf-8') as csv_file:
csv_writer = csv.writer(csv_file)
csv_headers = [
'Name', 'Date Entered', 'Tradename', 'Address', 'Email Address', 'License Type/Action'
]
csv_writer.writerow(csv_headers)
with open('pdf_file.pdf', 'rb') as pdf_file:
pdf = pdftotext.PDF(pdf_file)
lines = pdf[0][313:].split('\n')
new_list = []
count = 1
for line in lines:
new_list.append(line)
count += 1
if count % 8 == 0:
name = ''
date_entered = re.findall(r'Date Entered: (.[0-9]\/.[0-9]\/.[0-9].[0-9]?)', "".join(new_list))
trade_name = re.findall(r'Tradename: (.*[A-Z]?)(?<= ).*', "".join(new_list))
address = re.findall(r'Tradename: (.*[A-Z]?)(?<= ).*', "".join(new_list))
email = ''
license_type_or_action = ''
new_list.clear()
一些输出:
['DISTRICT ROW LLC\r', ' Premises No.: 0 License Key: 0 Date Entered: 09/08/2021\r', ' Tradename: OLSEN RUN WINERY Date Received: 09/03/2021\r', ' Address: 32900 DIAMOND HILL DR, HARRISBURG 97446\r', ' Email Address: rachel@olsenrun.com\r', 'License Type/Action: F-COM / N/O\r', '\r']
pdf,here。我试着用 PyDF4
打开你遇到了一个烦人的问题,PdfReadWarning: Superfluous whitespace found in object header [...]
仍然没有用 PyPDF 解决,可能是由于文件质量差。所以我使用 pdftotext
.
我试图用正则表达式标准找到每个块的起始索引。自动你也有结束,这是下一个块的索引(负 1)。
一旦有了开始和结束索引,相应的切片就会成为一个块。
import re
path_pdf = #
with open(path_pdf, 'r') as fd:
text = fd.read()
header = """Report Date: 9/14/2021
Oregon Liquor & Cannabis Commission
Page {} of 5
Weekly Applications Received
For Entry Dates: 09/04/2021 Through 09/10/2021"""
# globally remove header - the header depends on the page number
text_header_less = text
for i in range(1, 6):
text_header_less = text_header_less.replace(header.format(i), '')
text_header_less_lines = text_header_less.split('\n')
company_name_pattern = re.compile(r'^([A-Z]{3,})') # at least 3 consecutive capital letters...could be better:)
start_location_company_datas = []
for i, l in enumerate(text_header_less_lines):
if company_name_pattern.search(l) is not None:
start_location_company_datas += [i]
company_data = []
for start, end in zip(start_location_company_datas[:-1], start_location_company_datas[1:]):
# ! contain still empty lines - to be cleaned?
#company_data += ['\n'.join(text_header_less_lines[start-1: end])] # as a string
company_data += [[text_header_less_lines[start-1: end] ]] # as a list
for i in company_data[:2]:
print(i)
print('-'*20)
输出
[['', 'DISTRICT ROW LLC', 'Premises No.: 0', '', 'License Key:', '', '0', '', 'Tradename: OLSEN RUN WINERY', 'Address: 32900 DIAMOND HILL DR, HARRISBURG 97446', 'Email Address: rachel@olsenrun.com', 'License Type/Action: F-COM / N/O', '', 'Date Entered: 09/08/2021', 'Date Received: 09/03/2021', ''], ...]
备注:
- 可以通过删除空行、多余的空格来清理最终数据...这取决于您要查找的内容
- 无法自动对分隔符进行分组,因为在提取文本时不可见