如何将数据从 PDF 抓取到 Excel
How to scrape data from PDF into Excel
我正在尝试从 PDF 中抓取数据并将其保存到 excel 文件中。这是我需要的 pdf:https://www.medicaljournals.se/acta/content_files/files/pdf/98/219/Suppl219.pdf
但是,我需要抓取的不是所有数据,而是以下数据(下图),然后将其保存到不同单元格中的 excel:
从第 5 页开始,从 P001 到并包括 简介 - 有 P 号、标题、人名和简介。
目前,我只能将 PDF 文件转换为文本(下面是我的代码)并将其全部保存在一个单元格中,但我需要将其分成不同的单元格
import PyPDF2 as p2
PDFfile = open('Abstract Book from the 5th World Psoriasis and Psoriatic Arthritis
Conference 2018.pdf', 'rb')
pdfread = p2.PdfFileReader(PDFfile)
pdflist = []
i = 6
while i<pdfread.getNumPages():
pageinfo = pdfread.getPage(i)
#print(pageinfo.extractText())
i = i + 1
pdflist.append(pageinfo.extractText().replace('\n', ''))
print(pdflist)
您主要需要的是 'header' 正则表达式作为 15 个大写字母和 'article' 正则表达式字母 'P' 和 3 位数字。
另一个正则表达式可帮助您按任何关键字划分文本
article_re = re.compile(r'[P]\d{3}') #P001: letter 'P' and 3 digits
header_re = re.compile(r'[A-Z\s\-]{15,}|$') #min 15 UPPERCASE letters, including '\n' '-' and
key_word_delimeters = ['Peoples', 'Introduction','Objectives','Methods','Results','Conclusions','References']
file = open('data.pdf', 'rb')
pdf = pdf.PdfFileReader(file)
text = ''
for i in range(6, 63):
text += pdf.getPage(i).extractText() # all text in one variable
articles = []
for article in re.split(article_re, text):
header = re.match(header_re, article) # recieving a match
other_text = re.split(header_re, article)[1] # recieving other text
if header:
header = header.group() # get text from match
item = {'header': header}
first_name_letter = header[-1] # save the first letter of name to put it in right position. Some kind of HOT BUGFIX
header = header[:-1] # cut last character: the first letter of name
header = header.replace('\n', '') #delete linebreakers
header = header.replace('-', '') #delete line break symbol
other_text = first_name_letter + other_text
data_array = re.split(
'Introduction:|Objectives:|Methods:|Results:|Conclusions:|References:',
other_text)
for key, data in zip(key_word_delimeters, data_array):
item[key] = data.replace('\n', '')
articles.append(item)
我正在尝试从 PDF 中抓取数据并将其保存到 excel 文件中。这是我需要的 pdf:https://www.medicaljournals.se/acta/content_files/files/pdf/98/219/Suppl219.pdf
但是,我需要抓取的不是所有数据,而是以下数据(下图),然后将其保存到不同单元格中的 excel: 从第 5 页开始,从 P001 到并包括 简介 - 有 P 号、标题、人名和简介。
目前,我只能将 PDF 文件转换为文本(下面是我的代码)并将其全部保存在一个单元格中,但我需要将其分成不同的单元格
import PyPDF2 as p2
PDFfile = open('Abstract Book from the 5th World Psoriasis and Psoriatic Arthritis
Conference 2018.pdf', 'rb')
pdfread = p2.PdfFileReader(PDFfile)
pdflist = []
i = 6
while i<pdfread.getNumPages():
pageinfo = pdfread.getPage(i)
#print(pageinfo.extractText())
i = i + 1
pdflist.append(pageinfo.extractText().replace('\n', ''))
print(pdflist)
您主要需要的是 'header' 正则表达式作为 15 个大写字母和 'article' 正则表达式字母 'P' 和 3 位数字。 另一个正则表达式可帮助您按任何关键字划分文本
article_re = re.compile(r'[P]\d{3}') #P001: letter 'P' and 3 digits
header_re = re.compile(r'[A-Z\s\-]{15,}|$') #min 15 UPPERCASE letters, including '\n' '-' and
key_word_delimeters = ['Peoples', 'Introduction','Objectives','Methods','Results','Conclusions','References']
file = open('data.pdf', 'rb')
pdf = pdf.PdfFileReader(file)
text = ''
for i in range(6, 63):
text += pdf.getPage(i).extractText() # all text in one variable
articles = []
for article in re.split(article_re, text):
header = re.match(header_re, article) # recieving a match
other_text = re.split(header_re, article)[1] # recieving other text
if header:
header = header.group() # get text from match
item = {'header': header}
first_name_letter = header[-1] # save the first letter of name to put it in right position. Some kind of HOT BUGFIX
header = header[:-1] # cut last character: the first letter of name
header = header.replace('\n', '') #delete linebreakers
header = header.replace('-', '') #delete line break symbol
other_text = first_name_letter + other_text
data_array = re.split(
'Introduction:|Objectives:|Methods:|Results:|Conclusions:|References:',
other_text)
for key, data in zip(key_word_delimeters, data_array):
item[key] = data.replace('\n', '')
articles.append(item)