读取带有 Pandas & xlrd returns 错误的 XLS 文件; xlrd 自己打开文件

Read XLS file with Pandas & xlrd returns error; xlrd opens file on its own

我正在编写一些自动化脚本来处理 Python 中的 Excel 个文件,其中一些是 XLS 格式。这是我尝试使用 Pandas:

的代码片段
df = pd.read_excel(contents, engine='xlrd', skiprows=5, names=['some', 'column', 'headers'])

contents 是从 AWS S3 存储桶中提取的文件内容。当这条线运行时,我得到 [ERROR] ValueError: File is not a recognized excel file.

在解决此问题时,我尝试直接使用 xlrd 访问传播sheet:

book = xlrd.open_workbook(file_contents=contents)
print("Number of worksheets is {}".format(book.nsheets))
print("Worksheet names: {}".format(book.sheet_names()))

这没有错误,所以 xlrd 似乎将其识别为 Excel 文件,只是当 Pandas.

要求这样做时却没有

有人知道为什么 Pandas 不读取以 xlrd 作为引擎的文件吗?或者有人可以帮我从 xlrd 中获取 sheet 并将其转换为 Pandas 数据帧吗?

Or can someone help me take the sheet from xlrd and convert it into a Pandas dataframe?

pd.read_excel可以拿一本书...

import xlrd

book = xlrd.open_workbook(filename='./file_check/file.xls')

df = pd.read_excel(book, skiprows=5)

print(df)

   some   column headers
0     1     some     foo
1     2  strings     bar
2     3     here     yes
3     4      too      no

我将包含下面的代码,如果您想要 check/handle Excel 文件类型,这些代码可能会有所帮助。也许您可以根据自己的需要进行调整。

代码循环遍历本地文件夹并显示文件和扩展名,然后使用 python-magic 深入研究。它还有一列显示 guessing from mimetypes 但这不是那么好。放大框架的图像,看到一些 .xls 不是扩展所说的。此外,.txt 实际上是一个 Excel 文件。

import pandas as pd
import glob
import mimetypes
import os
# https://pypi.org/project/python-magic/
import magic

path = r'./file_check' # use your path
all_files = glob.glob(path + "/*.*")

data = []

for file in all_files:
    name, extension = os.path.splitext(file)
    data.append([file, extension, magic.from_file(file, mime=True), mimetypes.guess_type(file)[0]])

df = pd.DataFrame(data, columns=['Path', 'Extension', 'magic.from_file(file, mime=True)', 'mimetypes.guess_type'])

# del df['magic.from_file(file, mime=True)']

df

从那里您可以根据文件的类型过滤文件:

xlsx_file_format = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'

xls_file_format = 'application/vnd.ms-excel'

for file in all_files:
    if magic.from_file(file, mime=True) == xlsx_file_format:
        print('xlsx')
        #  DO SOMETHING SPECIAL WITH XLSX FILES
    elif magic.from_file(file, mime=True) == xls_file_format:
        print('xls')
        #  DO SOMETHING SPECIAL WITH XLS FILES
    else:
        continue

dfs = []

for file in all_files:
    if (magic.from_file(file, mime=True) == xlsx_file_format) or \
    (magic.from_file(file, mime=True) == xls_file_format):
        # who cares, it all works with this for the demo...
        df = pd.read_excel(file, skiprows=5, names=['some', 'column', 'headers'])
        dfs.append(df)
    
print('\nHow many frames did we get from seven files? ', len(dfs))

输出:

xlsx
xls
xls
xlsx

How many frames did we get from seven files?  4