Tabula: FileNotFoundError: [Errno 2] (but file path is corrent)
Tabula: FileNotFoundError: [Errno 2] (but file path is corrent)
问题:
import tabula as tb
import pandas as pd
other = "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf"
dfs = tb.read_pdf(other, stream=True) #this works
file="D:\Favorites. Programming\Projects\cell penetrating peptide supplemental.pdf"
tables = tb.read_pdf(file, pages = "all", multiple_tables = True)
tables
输出:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-29-c598474e8fa3> in <module>
6
7 file="D:\Favorites. Programming\Projects\cell penetrating peptide supplemental.pdf"
----> 8 tables = tb.read_pdf(file, pages = "all", multiple_tables = True)
9 tables
~\anaconda3\lib\site-packages\tabula\io.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, **kwargs)
312
313 if not os.path.exists(path):
--> 314 raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), path)
315
316 if os.path.getsize(path) == 0:
FileNotFoundError: [Errno 2] No such file or directory: 'D:\Favorites\x01. Programming\Projects\cell penetrating peptide supplemental.pdf'
似乎其他遇到此问题的人都没有得到解决。
我遵循的第一个建议是检查文件是否确实存在。
file=r"D:\Favorites. Programming\Projects\cell penetrating peptide supplemental.pdf"
print( os.path.isfile(file))
print(os.path.exists(file))
print(os.path.getsize(file) == 0)
输出:
True
True
False
????????为什么它会引发一个错误,它应该只在 print(os.path.exists(file))
为 False 时引发?
我尝试了一个来自 Internet 的文件,它运行良好。我正在尝试读取的文件没有 URL。我无法从我的浏览器中查看它。我只有下载它的选项。否则我会尝试将其 URL 送入函数中。
更新:
我尝试了建议的解决方案
import tabula as tb
import pandas as pd
tables = tb.read_pdf(r"D:\Favorites. Programming\Projects\cell penetrating peptide supplemental.pdf", pages = "all", multiple_tables = True)
tables
得到这个:
Got stderr: Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 4 (33) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 3 (34) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (35) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (36) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font FLAXFE+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (33) in font FLAXFE+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (34) in font FLAXFE+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 4 (33) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 3 (34) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (35) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (36) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font DCUQIG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (33) in font DCUQIG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font DREOWG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (33) in font DREOWG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font EWGNLJ+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (33) in font EWGNLJ+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (34) in font EWGNLJ+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font PUHGFM+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (33) in font PUHGFM+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (34) in font PUHGFM+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 4 (33) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 3 (34) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (35) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (36) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font UCENHU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (33) in font UCENHU+CambriaMath
问题是 tabula-py 有一个在 read_pdf
中调用的 localize_file
函数。 localize_file
将调用 os.path.expanduser
来扩展路径。例如,在类 Unix 系统中,“~”是用户主目录的别名。因此 os.path.expanduser
将在 Mac OS X
中进行以下扩展
>>> os.path.expanduser("~/Documents")
'/Users/username/Documents'
不幸的是,这个函数还有另一个作用:它将 \ 视为 ANSI 转义码的转义符号,因为它在函数内部调用了 os.fspath
。所以如果你 运行
>>> os.path.expanduser("5")
'U'
>>> os.fspath("5")
'U'
在你的情况下,路径中的 </code> 已被转义为 <code>\x01
,因此 Windows 找不到这样的目录。为了保持你的路径不变,将它作为原始字符串传递,即在它之前放一个 r
像这样
>>> os.path.expanduser(r"5")
'\125'
参考文献:
tabula's read_pdf line 311 localize_file is invoked
tabula's localize_file line 72 os.path.expanduser is invoked
问题:
import tabula as tb
import pandas as pd
other = "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf"
dfs = tb.read_pdf(other, stream=True) #this works
file="D:\Favorites. Programming\Projects\cell penetrating peptide supplemental.pdf"
tables = tb.read_pdf(file, pages = "all", multiple_tables = True)
tables
输出:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-29-c598474e8fa3> in <module>
6
7 file="D:\Favorites. Programming\Projects\cell penetrating peptide supplemental.pdf"
----> 8 tables = tb.read_pdf(file, pages = "all", multiple_tables = True)
9 tables
~\anaconda3\lib\site-packages\tabula\io.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, **kwargs)
312
313 if not os.path.exists(path):
--> 314 raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), path)
315
316 if os.path.getsize(path) == 0:
FileNotFoundError: [Errno 2] No such file or directory: 'D:\Favorites\x01. Programming\Projects\cell penetrating peptide supplemental.pdf'
似乎其他遇到此问题的人都没有得到解决。
我遵循的第一个建议是检查文件是否确实存在。
file=r"D:\Favorites. Programming\Projects\cell penetrating peptide supplemental.pdf"
print( os.path.isfile(file))
print(os.path.exists(file))
print(os.path.getsize(file) == 0)
输出:
True
True
False
????????为什么它会引发一个错误,它应该只在 print(os.path.exists(file))
为 False 时引发?
我尝试了一个来自 Internet 的文件,它运行良好。我正在尝试读取的文件没有 URL。我无法从我的浏览器中查看它。我只有下载它的选项。否则我会尝试将其 URL 送入函数中。
更新: 我尝试了建议的解决方案
import tabula as tb
import pandas as pd
tables = tb.read_pdf(r"D:\Favorites. Programming\Projects\cell penetrating peptide supplemental.pdf", pages = "all", multiple_tables = True)
tables
得到这个:
Got stderr: Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 4 (33) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 3 (34) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (35) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (36) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font FLAXFE+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (33) in font FLAXFE+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (34) in font FLAXFE+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 4 (33) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 3 (34) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (35) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (36) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font DCUQIG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (33) in font DCUQIG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font DREOWG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (33) in font DREOWG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font EWGNLJ+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (33) in font EWGNLJ+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (34) in font EWGNLJ+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font PUHGFM+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (33) in font PUHGFM+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (34) in font PUHGFM+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 4 (33) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 3 (34) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (35) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (36) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font UCENHU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (33) in font UCENHU+CambriaMath
问题是 tabula-py 有一个在 read_pdf
中调用的 localize_file
函数。 localize_file
将调用 os.path.expanduser
来扩展路径。例如,在类 Unix 系统中,“~”是用户主目录的别名。因此 os.path.expanduser
将在 Mac OS X
>>> os.path.expanduser("~/Documents")
'/Users/username/Documents'
不幸的是,这个函数还有另一个作用:它将 \ 视为 ANSI 转义码的转义符号,因为它在函数内部调用了 os.fspath
。所以如果你 运行
>>> os.path.expanduser("5")
'U'
>>> os.fspath("5")
'U'
在你的情况下,路径中的 </code> 已被转义为 <code>\x01
,因此 Windows 找不到这样的目录。为了保持你的路径不变,将它作为原始字符串传递,即在它之前放一个 r
像这样
>>> os.path.expanduser(r"5")
'\125'
参考文献:
tabula's read_pdf line 311 localize_file is invoked
tabula's localize_file line 72 os.path.expanduser is invoked