PDF跨多行解析一个句子
PDF Parsing a sentence across multiple Lines
目标:如果 pdf 行包含子字符串,则复制整个句子(跨多行)。
我能够print()
line
phrase
出现在
现在,一旦我找到这个line
,我想返回迭代,直到我找到一个句子结束符:. ! ?
,从上一个句子开始,再次向前迭代直到下一个句子终结者。
这是我所能做到的 print()
该短语所属的整个句子。
Jupyter 笔记本:
# pip install PyPDF2
# pip install pdfplumber
# ---
# import re
import glob
import PyPDF2
import pdfplumber
# ---
phrase = "Responsible Care Company"
# SENTENCE_REGEX = re.pattern('^[A-Z][^?!.]*[?.!]$')
def scrape_sentence(sentence, lines, index):
if '.' in lines[index] or '!' in lines[index] or '?' in lines[index]:
return sentence.replace('\n', '').strip()
sentence = scrape_sentence(lines[index-1] + sentence, lines, index-1) # previous line
sentence = scrape_sentence(sentence + lines[index+1], lines, index+1) #
following line
return sentence
# ---
with pdfplumber.open('../data/gri/reports/GPIC_Sustainability_Report_2020__-_40_Years_of_Sustainable_Success.pdf') as opened_pdf:
for page in opened_pdf.pages:
text = page.extract_text()
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
sentence = scrape_sentence('', lines, i) # !
print(sentence) # !
i += 1
输出:
connection and the linkage to the relevant UN’s 17 SDGs.and Leadership. We have long realized and recognized that there
短语:
Responsible Care Company
句子(跨多行):
"GPIC is a Responsible Care Company certified for RC 14001
since July 2010."
我一直在研究 “回溯” 迭代,基于此 solution。我确实尝试了 for-loop
,但它不会让你返回迭代。
如果我还有什么可以补充的,请告诉我post。
您收到的错误是由于您的代码试图修改 None.
类型的对象造成的
要解决此问题,有两个选项,第一个是在 if 语句中包围拆分操作
for page in opened_pdf.pages:
text = page.extract_text()
if text != None:
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
sentence = scrape_sentence('', lines, i)
print(sentence)
i += 1
或者您可以使用 continue statement 跳过循环的其余部分:
for page in opened_pdf.pages:
text = page.extract_text()
if text == None:
continue
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
sentence = scrape_sentence('', lines, i) # !
print(sentence) # !
i += 1
我有一个工作版本。但是,这 不 说明来自 .pdf
页面的多列文本。
有关相关讨论,请参阅 here。
示例.pdf
Jupyter 笔记本:
# pip install PyPDF2
# pip install pdfplumber
# ---
import glob
import PyPDF2
import pdfplumber
# ---
def scrape_sentence(phrase, lines, index):
# -- Gather sentence 'phrase' occurs in --
sentence = lines[index]
print("-- sentence --", sentence)
print("len(lines)", len(lines))
# Previous lines
pre_i, flag = index, 0
while flag == 0:
pre_i -= 1
if pre_i <= 0:
break
sentence = lines[pre_i] + sentence
if '.' in lines[pre_i] or '!' in lines[pre_i] or '?' in lines[pre_i] or ' • ' in lines[pre_i]:
flag == 1
print("\n", sentence)
# Following lines
post_i, flag = index, 0
while flag == 0:
post_i += 1
if post_i >= len(lines):
break
sentence = sentence + lines[post_i]
if '.' in lines[post_i] or '!' in lines[post_i] or '?' in lines[post_i] or ' • ' in lines[pre_i]:
flag == 1
print("\n", sentence)
# -- Extract --
sentence = sentence.replace('!', '.')
sentence = sentence.replace('?', '.')
sentence = sentence.split('.')
sentence = [s for s in sentence if phrase in s]
print(sentence)
sentence = sentence[0].replace('\n', '').strip() # first occurance
print(sentence)
return sentence
# ---
phrase = 'Global Reporting Initiative'
with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
for page in opened_pdf.pages:
text = page.extract_text()
if text == None:
continue
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if phrase in lines[i]:
sentence = scrape_sentence(phrase, lines, i)
i += 1
输出:
-- sentence -- 2016 Global Reporting Initiative (GRI) Report
len(lines) 7
2016 Global Reporting Initiative (GRI) Report
2016 Global Reporting Initiative (GRI) ReportIncluding: UN Global Compact - Communication on ProgressUN Global Compact - Food and Agriculture Business PrinciplesUN Global Compact - Women’s Empowerment Principlesgulf petrochemical industries companyii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01
['2016 Global Reporting Initiative (GRI) ReportIncluding: UN Global Compact - Communication on ProgressUN Global Compact - Food and Agriculture Business PrinciplesUN Global Compact - Women’s Empowerment Principlesgulf petrochemical industries companyii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01']
2016 Global Reporting Initiative (GRI) ReportIncluding: UN Global Compact - Communication on ProgressUN Global Compact - Food and Agriculture Business PrinciplesUN Global Compact - Women’s Empowerment Principlesgulf petrochemical industries companyii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01
...
目标:如果 pdf 行包含子字符串,则复制整个句子(跨多行)。
我能够print()
line
phrase
出现在
现在,一旦我找到这个line
,我想返回迭代,直到我找到一个句子结束符:. ! ?
,从上一个句子开始,再次向前迭代直到下一个句子终结者。
这是我所能做到的 print()
该短语所属的整个句子。
Jupyter 笔记本:
# pip install PyPDF2
# pip install pdfplumber
# ---
# import re
import glob
import PyPDF2
import pdfplumber
# ---
phrase = "Responsible Care Company"
# SENTENCE_REGEX = re.pattern('^[A-Z][^?!.]*[?.!]$')
def scrape_sentence(sentence, lines, index):
if '.' in lines[index] or '!' in lines[index] or '?' in lines[index]:
return sentence.replace('\n', '').strip()
sentence = scrape_sentence(lines[index-1] + sentence, lines, index-1) # previous line
sentence = scrape_sentence(sentence + lines[index+1], lines, index+1) #
following line
return sentence
# ---
with pdfplumber.open('../data/gri/reports/GPIC_Sustainability_Report_2020__-_40_Years_of_Sustainable_Success.pdf') as opened_pdf:
for page in opened_pdf.pages:
text = page.extract_text()
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
sentence = scrape_sentence('', lines, i) # !
print(sentence) # !
i += 1
输出:
connection and the linkage to the relevant UN’s 17 SDGs.and Leadership. We have long realized and recognized that there
短语:
Responsible Care Company
句子(跨多行):
"GPIC is a Responsible Care Company certified for RC 14001
since July 2010."
我一直在研究 “回溯” 迭代,基于此 solution。我确实尝试了 for-loop
,但它不会让你返回迭代。
如果我还有什么可以补充的,请告诉我post。
您收到的错误是由于您的代码试图修改 None.
类型的对象造成的要解决此问题,有两个选项,第一个是在 if 语句中包围拆分操作
for page in opened_pdf.pages:
text = page.extract_text()
if text != None:
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
sentence = scrape_sentence('', lines, i)
print(sentence)
i += 1
或者您可以使用 continue statement 跳过循环的其余部分:
for page in opened_pdf.pages:
text = page.extract_text()
if text == None:
continue
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
sentence = scrape_sentence('', lines, i) # !
print(sentence) # !
i += 1
我有一个工作版本。但是,这 不 说明来自 .pdf
页面的多列文本。
有关相关讨论,请参阅 here。
示例.pdf
Jupyter 笔记本:
# pip install PyPDF2
# pip install pdfplumber
# ---
import glob
import PyPDF2
import pdfplumber
# ---
def scrape_sentence(phrase, lines, index):
# -- Gather sentence 'phrase' occurs in --
sentence = lines[index]
print("-- sentence --", sentence)
print("len(lines)", len(lines))
# Previous lines
pre_i, flag = index, 0
while flag == 0:
pre_i -= 1
if pre_i <= 0:
break
sentence = lines[pre_i] + sentence
if '.' in lines[pre_i] or '!' in lines[pre_i] or '?' in lines[pre_i] or ' • ' in lines[pre_i]:
flag == 1
print("\n", sentence)
# Following lines
post_i, flag = index, 0
while flag == 0:
post_i += 1
if post_i >= len(lines):
break
sentence = sentence + lines[post_i]
if '.' in lines[post_i] or '!' in lines[post_i] or '?' in lines[post_i] or ' • ' in lines[pre_i]:
flag == 1
print("\n", sentence)
# -- Extract --
sentence = sentence.replace('!', '.')
sentence = sentence.replace('?', '.')
sentence = sentence.split('.')
sentence = [s for s in sentence if phrase in s]
print(sentence)
sentence = sentence[0].replace('\n', '').strip() # first occurance
print(sentence)
return sentence
# ---
phrase = 'Global Reporting Initiative'
with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf:
for page in opened_pdf.pages:
text = page.extract_text()
if text == None:
continue
lines = text.split('\n')
i = 0
sentence = ''
while i < len(lines):
if phrase in lines[i]:
sentence = scrape_sentence(phrase, lines, i)
i += 1
输出:
-- sentence -- 2016 Global Reporting Initiative (GRI) Report
len(lines) 7
2016 Global Reporting Initiative (GRI) Report
2016 Global Reporting Initiative (GRI) ReportIncluding: UN Global Compact - Communication on ProgressUN Global Compact - Food and Agriculture Business PrinciplesUN Global Compact - Women’s Empowerment Principlesgulf petrochemical industries companyii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01
['2016 Global Reporting Initiative (GRI) ReportIncluding: UN Global Compact - Communication on ProgressUN Global Compact - Food and Agriculture Business PrinciplesUN Global Compact - Women’s Empowerment Principlesgulf petrochemical industries companyii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01']
2016 Global Reporting Initiative (GRI) ReportIncluding: UN Global Compact - Communication on ProgressUN Global Compact - Food and Agriculture Business PrinciplesUN Global Compact - Women’s Empowerment Principlesgulf petrochemical industries companyii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01
...