跨多行抓取一个句子 |递归错误未解决

Scraping a sentence across many lines | Recursive error unresolved

目标:如果 pdf 行包含子字符串,则复制整个句子(跨多行)。

我能够print() line phrase 出现在

现在,一旦我找到这个line,我想返回迭代,直到我找到一个句子终止符:. ! ?,从上一个句子开始,再次向前迭代直到下一个句子终结者。

这是我能做到的 print() 该短语所属的整个句子。

但是,我有一个递归错误 scrape_sentence() 无限卡住 运行。


Jupyter 笔记本:

# pip install PyPDF2
# pip install pdfplumber

# ---
# import re
import glob
import PyPDF2
import pdfplumber

# ---
phrase = "Responsible Care Company"
# SENTENCE_REGEX = re.pattern('^[A-Z][^?!.]*[?.!]$')

def scrape_sentence(sentence, lines, index, phrase):
    if '.' in lines[index] or '!' in lines[index] or '?' in lines[index]:
        return sentence.replace('\n', '').strip()
    sentence = scrape_sentence(lines[index-1] + sentence, lines, index-1, phrase)  # previous line
    sentence = scrape_sentence(sentence + lines[index+1], lines, index+1, phrase)  # following line    
    
    sentence = sentence.replace('!', '.')
    sentence = sentence.replace('?', '.')
    sentence = sentence.split('.')
    sentence = [s for s in sentence if phrase in s]
    sentence = sentence[0]  # first occurance
    print(sentence)
    
    return sentence
    
# ---    
    
with pdfplumber.open('../data/gri/reports/GPIC_Sustainability_Report_2020__-_40_Years_of_Sustainable_Success.pdf') as opened_pdf:
    for page in opened_pdf.pages:
        text = page.extract_text()
        lines = text.split('\n')
        i = 0
        sentence = ''
        while i < len(lines):
            if 'and Knowledge of Individuals; Behaviours; Attitudes, Perception ' in lines[i]:
                sentence = scrape_sentence('', lines, i)  # !
                print(sentence)  # !
            i += 1

输出:

connection and the linkage to the relevant UN’s 17 SDGs.and Leadership. We have long realized and recognized that there

短语:

Responsible Care Company

句子(跨多行):

"GPIC is a Responsible Care Company certified for RC 14001 
since July 2010."

PDF (pg. 2).


如果我还有什么要补充的,请告诉我post。

我通过删除 scrape_sentence() 中的任何递归解决了这个问题