如何调整 python 中的 Dataframe 以包含 Source/PDF 中的第一行

Question

请帮忙。

Python Dataframe reads/return pdf missing/excluding 第一行的数据如下。这可能是因为 pdf 是如何在源代码中生成的。 sample data image in dataframe 有没有办法调整大小或重组以选择 table 的第一行？请帮忙

import tabula
import pandas as pd

file = "sample.pdf"
tables = tabula.read_pdf(file, pages=1, multiple_tables=True)

df = pd.DataFrame(tables[0])
df = df.reset_index()

for index, row in df.iterrows():
    print(row[0], row[1], row[2], row[3],row[4])

Answer 1

正如我看到的输出图像，tabula 已将您数据的第一行视为 table 的 header。这可能是因为没有 header 存在，所以 Tabula 将第一行视为列名。

阻止 Tabula 将第一行转换为列 header 的最简单方法是使用 Tabula 的 pandas_options 参数。

添加如下参数：

tables = tabula.read_pdf(file, pages=1, multiple_tables=True, pandas_options={'header':None})

这应该会阻止 Tabula 将第一个数据行转换为您的列 headers。

如何调整 python 中的 Dataframe 以包含 Source/PDF 中的第一行

How to resize Dataframe in python to include first row in Source/PDF

python

resize

dataframe