读取带有 Pandas & xlrd returns 错误的 XLS 文件； xlrd 自己打开文件

Question

我正在编写一些自动化脚本来处理 Python 中的 Excel 个文件，其中一些是 XLS 格式。这是我尝试使用 Pandas:

的代码片段

df = pd.read_excel(contents, engine='xlrd', skiprows=5, names=['some', 'column', 'headers'])

contents 是从 AWS S3 存储桶中提取的文件内容。当这条线运行时，我得到 [ERROR] ValueError: File is not a recognized excel file.

在解决此问题时，我尝试直接使用 xlrd 访问传播sheet：

book = xlrd.open_workbook(file_contents=contents)
print("Number of worksheets is {}".format(book.nsheets))
print("Worksheet names: {}".format(book.sheet_names()))

这没有错误，所以 xlrd 似乎将其识别为 Excel 文件，只是当 Pandas.

要求这样做时却没有

有人知道为什么 Pandas 不读取以 xlrd 作为引擎的文件吗？或者有人可以帮我从 xlrd 中获取 sheet 并将其转换为 Pandas 数据帧吗？

Answer 1

Or can someone help me take the sheet from xlrd and convert it into a Pandas dataframe?

pd.read_excel可以拿一本书...

import xlrd

book = xlrd.open_workbook(filename='./file_check/file.xls')

df = pd.read_excel(book, skiprows=5)

print(df)

   some   column headers
0     1     some     foo
1     2  strings     bar
2     3     here     yes
3     4      too      no

我将包含下面的代码，如果您想要 check/handle Excel 文件类型，这些代码可能会有所帮助。也许您可以根据自己的需要进行调整。

代码循环遍历本地文件夹并显示文件和扩展名，然后使用 python-magic 深入研究。它还有一列显示 guessing from mimetypes 但这不是那么好。放大框架的图像，看到一些 .xls 不是扩展所说的。此外，.txt 实际上是一个 Excel 文件。

import pandas as pd
import glob
import mimetypes
import os
# https://pypi.org/project/python-magic/
import magic

path = r'./file_check' # use your path
all_files = glob.glob(path + "/*.*")

data = []

for file in all_files:
    name, extension = os.path.splitext(file)
    data.append([file, extension, magic.from_file(file, mime=True), mimetypes.guess_type(file)[0]])

df = pd.DataFrame(data, columns=['Path', 'Extension', 'magic.from_file(file, mime=True)', 'mimetypes.guess_type'])

# del df['magic.from_file(file, mime=True)']

df

从那里您可以根据文件的类型过滤文件：

xlsx_file_format = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'

xls_file_format = 'application/vnd.ms-excel'

for file in all_files:
    if magic.from_file(file, mime=True) == xlsx_file_format:
        print('xlsx')
        #  DO SOMETHING SPECIAL WITH XLSX FILES
    elif magic.from_file(file, mime=True) == xls_file_format:
        print('xls')
        #  DO SOMETHING SPECIAL WITH XLS FILES
    else:
        continue

dfs = []

for file in all_files:
    if (magic.from_file(file, mime=True) == xlsx_file_format) or \
    (magic.from_file(file, mime=True) == xls_file_format):
        # who cares, it all works with this for the demo...
        df = pd.read_excel(file, skiprows=5, names=['some', 'column', 'headers'])
        dfs.append(df)
    
print('\nHow many frames did we get from seven files? ', len(dfs))

输出：

xlsx
xls
xls
xlsx

How many frames did we get from seven files?  4

读取带有 Pandas & xlrd returns 错误的 XLS 文件； xlrd 自己打开文件

Read XLS file with Pandas & xlrd returns error; xlrd opens file on its own

python

xlrd

pandas