未从 PyPDF2 上的 PDF 正则表达式接收到正确的模式

Question

我想从 PDF 中提取特定单词的所有实例，例如 'math'。到目前为止，我正在使用 PyPDF2 将 PDF 转换为文本，然后对其执行正则表达式以找到我想要的内容。这是 example PFD

当我运行我的代码而不是返回我 'math' 的正则表达式模式时，它 returns 整个页面的字符串。请帮忙谢谢

#First Change Current Working Directory to desktop

import os
os.chdir('/Users/Hussein/Desktop')         #File is located on Desktop


#Second is the PyPDF2

pdfFileObj=open('TEST1.pdf','rb')          #Opening the File
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
pageObj=pdfReader.getPage(3)               #For the test I only need page 3
TextVersion=pageObj.extractText()
print(TextVersion)



#Third is the Regular Expression

import re
match=re.findall(r'math',TextVersion)
for match in TextVersion:
      print(match)

我收到的不是 'math' 的所有实例：

I
n
t
r
o
d
u
c
t
i
o
n

等等等等

Answer 1

您实际上是在迭代 TextVersion 变量的值。您必须遍历 re.findall.

返回的列表

所以你的 for 循环必须是，

match=re.findall(r'math',TextVersion)
for i in match:
    print(i)

Answer 2

TextVersion 变量保存文本。当您将它用于 for 循环时，它会一次为您提供文本一个字符，如您所见。 findall 函数将 return 一个匹配列表，因此如果你将它用于你的 for 循环，你将得到每个单词（在你的测试中它们都是相同的）。

import re

for match in re.findall(r'math',TextVersion):
      print(match)

来自 findall 的 returned 结果类似于：

["math", "math", "math"]

所以你的输出将是：

math
math
math

未从 PyPDF2 上的 PDF 正则表达式接收到正确的模式

Not receiving correct pattern from regex on PyPDF2 for a PDF

python

regex

pdf

pypdf

python-3.x