Getting TypeError: ord() expected string of length 1, but int found error

Question

代码是

from PyPDF2 import PdfFileReader
with open('HTTP_Book.pdf','rb') as file:
    pdf=PdfFileReader(file)
    pagedd=pdf.getPage(0)
    print(pagedd.extractText())

此代码引发如下所示的错误：

TypeError: ord() expected string of length 1, but int found

我在互联网上搜索并找到了这个 Troubleshooting "TypeError: ord() expected string of length 1, but int found" 但这并没有多大帮助。我知道这个错误的背景是什么，但不确定它与这里有什么关系？

尝试更改 pdf 文件并且它工作正常。那么问题出在哪里：pdf 文件或 PyPDF2 无法处理它？根据文档，我知道这种方法不太可靠：

This works well for some PDF files, but poorly for others, depending on the generator used

应该如何处理？

回溯：

Traceback (most recent call last):
  File "pdf_reader.py", line 71, in <module>
    print(pagedd.extractText())
  File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 2595, in ex
tractText
    content = ContentStream(content, self.pdf)
  File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\pdf.py", line 2673, in __
init__
    stream = BytesIO(b_(stream.getData()))
  File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\generic.py", line 841, in
 getData
    decoded._data = filters.decodeStreamData(self)
  File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 350, in
 decodeStreamData
    data = LZWDecode.decode(data, stream.get("/DecodeParms"))
  File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 255, in
 decode
    return LZWDecode.decoder(data).decode()
  File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 228, in
 decode
    cW = self.nextCode();
  File "C:\Users\Jeet\AppData\Local\Programs\Python\Python37\lib\site-packages\PyPDF2\filters.py", line 205, in
 nextCode
    nextbits=ord(self.data[self.bytepos])
TypeError: ord() expected string of length 1, but int found

Answer 1

我知道了。这只是 PyPDF2 的一个限制。我使用 tika 和 BeautifulSoup 来解析和提取文本，它工作正常。虽然它需要更多的工作。

from tika import parser 
from bs4 import BeautifulSoup
raw=parser.from_file('HTTP_Book.pdf',xmlContent=True)['content']
data=BeautifulSoup(raw,'lxml')
message=data.find(class_='page') # for first page
print(message.text)

Getting TypeError: ord() expected string of length 1, but int found error

Getting TypeError: ord() expected string of length 1, but int found error

python

python-3.x

pypdf2