水管工 |从动态列布局中提取文本

Question

post底部的尝试解决方案。

我有接近工作的代码，可以跨多个 行 [=57] 提取包含短语的句子 =].

但是，有些页面有专栏。所以各自的输出不正确；单独的文本被错误地合并为一个错误的句子。

此问题已在以下 post 中得到解决：

Solution 1

Solution 2

问题：

如何“if-condition”是否有列？

页面可能没有专栏，

页面可能超过 2 列。

页面可能还有页眉和页脚（可以省略）。

使用动态文本布局的示例 .pdf：PDF (pg. 2)。

Jupyter 笔记本：

# pip install PyPDF2 # pip install pdfplumber # --- import pdfplumber # --- def scrape_sentence(phrase, lines, index): # -- Gather sentence 'phrase' occurs in -- sentence = lines[index] print("-- sentence --", sentence) print("len(lines)", len(lines)) # Previous lines pre_i, flag = index, 0 while flag == 0: pre_i -= 1 if pre_i <= 0: break sentence = lines[pre_i] + sentence if '.' in lines[pre_i] or '!' in lines[pre_i] or '?' in lines[pre_i] or ' • ' in lines[pre_i]: flag == 1 print("\n", sentence) # Following lines post_i, flag = index, 0 while flag == 0: post_i += 1 if post_i >= len(lines): break sentence = sentence + lines[post_i] if '.' in lines[post_i] or '!' in lines[post_i] or '?' in lines[post_i] or ' • ' in lines[pre_i]: flag == 1 print("\n", sentence) # -- Extract -- sentence = sentence.replace('!', '.') sentence = sentence.replace('?', '.') sentence = sentence.split('.') sentence = [s for s in sentence if phrase in s] print(sentence) sentence = sentence[0].replace('\n', '').strip() # first occurance print(sentence) return sentence # --- phrase = 'Gulf Petrochemical Industries Company' with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf: for page in opened_pdf.pages: text = page.extract_text() if text == None: continue lines = text.split('\n') i = 0 sentence = '' while i < len(lines): if phrase in lines[i]: sentence = scrape_sentence(phrase, lines, i) i += 1

示例错误输出：

-- sentence -- being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of len(lines) 47 Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of Company (GPIC)gulf petrochemical industries company (gpic) is a leading joint venture setup and owned by the government of the kingdom of bahrain, saudi basic industries corporation (sabic), kingdom of saudi arabia and petrochemical industries company (pic), kuwait. gpic was set up for the purposes of manufacturing fertilizers and petrochemicals. being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption. represented by natural gas purchases, empowering bahraini nationals through training & employment, utilisation of local contractors and suppliers, energy consumption and other financial, commercial, environmental and social activities that arise as a part of our core operations within the kingdom.GPIC becomes an organizational stakeholder of Global Reporting for the purpose of clarity throughout this report, Initiative ( GRI) in 2014. By supporting GRI, Organizational ‘gpic’, ’we’ ‘us’, and ‘our’ refer to the gulf Stakeholders (OS) like GPIC, demonstrate their commitment to transparency, accountability and sustainability to a worldwide petrochemical industries company; ‘sabic’ refers to network of multi-stakeholders.the saudi basic industries corporation; ‘pic’ refers to the petrochemical industries company, kuwait; ‘nogaholding’ refers to the oil and gas holding company, kingdom of bahrain; and ‘board’ refers to our board of directors represented by a group formed by nogaholding, sabic and pic.the oil and gas holding company (nogaholding) is GPIC is a Responsible Care Company certified for RC 14001 since July 2010. We are committed to the safe, ethical and the business and investment arm of noga (national environmentally sound management of the petrochemicals oil and gas authority) and steward of the bahrain and fertilizers we make and export. Stakeholders’ well-being is government’s investment in the bahrain petroleum always a key priority at GPIC.company (bapco), the bahrain national gas company (banagas), the bahrain national gas expansion company (bngec), the bahrain aviation fuelling company (bafco), the bahrain lube base oil company, the gulf petrochemical industries company (gpic), and tatweer petroleum.GPIC SuStaInabIlIty RePoRt 2016 01ii GPIC SuStaInabIlIty RePoRt 2016 GPIC SuStaInabIlIty RePoRt 2016 01 [' being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption'] being a major manufacturer within the kingdom of In 2012, Gulf Petrochemical Industries Company becomes part of the global transformation for a sustainable future by committing to bahrain, gpic is also a proactive stakeholder within the United Nations Global Compact’s ten principles in the realms the kingdom and the region with our activities being of Human Rights, Labour, Environment and Anti-Corruption ...

尝试的最小解决方案： 这会将文本分成两列；不管有没有2.

# pip install PyPDF2 # pip install pdfplumber # --- import pdfplumber import decimal # --- with pdfplumber.open('GPIC_Sustainability_Report_2016-v9_(lr).pdf') as opened_pdf: for page in opened_pdf.pages: left = page.crop((0, 0, decimal.Decimal(0.5) * page.width, decimal.Decimal(0.9) * page.height)) right = page.crop((decimal.Decimal(0.5) * page.width, 0, page.width, page.height)) l_text = left.extract_text() r_text = right.extract_text() print("\n -- l_text --", l_text) print("\n -- r_text --", r_text) text = str(l_text) + " " + str(r_text)

如果还有什么需要澄清的，请告诉我。

Answer 1

此答案使您能够按预期顺序抓取文本。

走向数据科学文章PDF Text Extraction in Python：

Compared with PyPDF2, PDFMiner’s scope is much more limited, it really focuses only on extracting the text from the source information of a pdf file.

from io import StringIO

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert_pdf_to_string(file_path):
    output_string = StringIO()
    with open(file_path, 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)

    return(output_string.getvalue())

file_path = ''  # !
text = convert_pdf_to_string(file_path)
print(text)

之后可以进行洁面。

水管工 |从动态列布局中提取文本

pdfplumber | Extract text from dynamic column layouts

python

if-statement

text-extraction

information-extraction

pdfplumber