读取 PDF 文件并使用正则表达式过滤内容

Reading in a PDF file and filter content using a regex

我正在尝试使用正则表达式过滤 PDF 文件,输出仅是正则表达式过滤的词。

这是我的代码:

# FILTER PDF CONTENT FOR PHI USING REGEX

import PyPDF2
import re
# creating a pdf file object 
pdfFileObj = open('pdf.pdf', 'rb')

# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 


# creating a page object 
pageObj = pdfReader.getPage(0) 

# extracting text from page 
read=pageObj.extractText()

regex2 = re.compile(r'(?:flexibility|Alaska|)')

e=regex2.findall(read)
print(e)

这是我的输出:

['', '','', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'flexibility', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''

如果向右滚动,您会看到我找到了我的正则表达式词(灵活性),但为什么所有的逗号都在那里?有任何想法吗?我可能遗漏了一个小细节,但似乎找不到在哪里?

读取输出:

The pdf995 suite of products - Pdf995, PdfEdit995, and Signature995 - is a complete solution for your document publishing needs. It provides ease of use, flexibility in format, and industry-standard security- and all at no cost to you. Pdf995 makes it easy and affordable to create professional-quality documents in the popular PDF file format. Its easy-to-use interface helps you to create PDF files by simply selecting the "print" command from any application, creating documents which can be viewed on any computer with a PDF viewer. Pdf995 supports network file saving, fast user switching on XP, Citrix/Terminal Server, custom page sizes and large format printing. Pdf995 is a printer driver that works with any Postscript to PDF converter. The pdf995 printer driver and a free Converter are available for easy download. PdfEdit995 offers a wealth of additional functionality, such as: combining documents into a single PDF; automatic link insertion; hierarchical bookmark insertion; PDF conversion to HTML or DOC (text only); integration with Word toolbar with automatic table of contents and link generation; autoattach to email; stationery and stamping.  Signature995 offers state-of-the-art security and encryption to protect your documents and add digital signatures.  

 The Pdf995 Suite offers the following features, all at no cost: Automatic insertion of embedded links Hierarchical Bookmarks Support for Digital Signatures Support for Triple DES encryption Append and Delete PDF Pages Batch Print from Microsoft Office Asian and Cyrillic fonts Integration with Microsoft Word toolbar PDF Stationery Combining multiple PDF's into a single PDF Three auto-name options to bypass Save As dialog Imposition of Draft/Confidential stamps Support for large format architectural printing Convert PDF to JPEG, TIFF, BMP, PCX formats Convert PDF to HTML and Word DOC conversion Convert PDF to text Automatic Table of Contents generation Support for XP Fast User Switching and multiple user sessions Standard PDF Encryption (restricted printing, modifying, copying text and images) Support for Optimized PDF Support for custom page sizes Option to attach PDFs to email after creation  Automatic text summarization of PDF documents Easy integration with document management and Workflow systems n-Up printing Automatic page numbering Simple Programmers Interface Option to automatically display PDFs after creation Custom resizing of PDF output Configurable Font embedding Support for Citrix/Terminal Server Support for Windows 2003 Server Easy PS to PDF processing Specify PDF document properties Control PDF opening mode Can be configured to add functionality to Acrobat Distiller Free: Creates PDFs without annoying watermarks Free: Fully functional, not a trial and does not expire Over 5 million satisfied customers Over 1000 Enterprise Customers worldwide  Please visit us at www.pdf995.com to learn more.  This document illustrates several features of the Pdf995 Suite of Products. 

您的模式末尾有一个 |,后面没有任何字符,它将匹配任何内容。删除:

regex2 = re.compile(r'(?:flexibility|Alaska)')

e=regex2.findall(ReSearch)

此外,使用这种简单的模式,您可以删除非捕获组:

regex2 = re.compile(r'flexibility|Alaska')