如何使用 Pandas 和 tabula-py 从一个 PDF 文件中提取多个表格

Question

谁能帮我从 ONE pdf 文件中提取多个 tables。我有 5 页，每页都有一个 table 和相同的 header 列 exp:

Table 每页 exp

student  Score Rang
Alex     50     23
Julia    80     12
Mariana  94     4

我想在一个数据框中提取所有这些 tables，首先我做了

df = tabula.read_pdf(file_path,pages='all',multiple_tables=True)

但是我得到了一个混乱的输出，所以我尝试了这行看起来像这样的代码：

[student  Score Rang
Alex     50     23
Julia    80     12
Mariana  94     4 ,student  Score Rang
Maxim    43     34
Nourah   93     5]

所以我这样编辑了我的代码将 pandas 导入为 pd 导入表格

    file_path = "filePath.pdf"
    
    # read my file
    df1 = tabula.read_pdf(file_path,pages=1,multiple_tables=True)
    df2 = tabula.read_pdf(file_path,pages=2,multiple_tables=True)
    df3 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
    df4 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
    df5 = tabula.read_pdf(file_path,pages=5,multiple_tables=True)

它为每个 table 提供了一个数据框，但我不知道如何将它重新组合成一个数据框和任何其他解决方案以避免重复代码行。

Answer 1

根据 documentation of tabula，read_pdf returns 一个列表，当通过 multiple_table=True 选项时。

因此，您可以在其输出上使用 pandas.concat 来连接数据帧：

df = pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))

如何使用 Pandas 和 tabula-py 从一个 PDF 文件中提取多个表格

How to extract multiples tables from one PDF file using Pandas and tabula-py

python

pdf

dataframe

pandas

tabula