读取带有 Pandas & xlrd returns 错误的 XLS 文件; xlrd 自己打开文件
Read XLS file with Pandas & xlrd returns error; xlrd opens file on its own
我正在编写一些自动化脚本来处理 Python 中的 Excel 个文件,其中一些是 XLS 格式。这是我尝试使用 Pandas:
的代码片段
df = pd.read_excel(contents, engine='xlrd', skiprows=5, names=['some', 'column', 'headers'])
contents
是从 AWS S3 存储桶中提取的文件内容。当这条线运行时,我得到 [ERROR] ValueError: File is not a recognized excel file
.
在解决此问题时,我尝试直接使用 xlrd 访问传播sheet:
book = xlrd.open_workbook(file_contents=contents)
print("Number of worksheets is {}".format(book.nsheets))
print("Worksheet names: {}".format(book.sheet_names()))
这没有错误,所以 xlrd 似乎将其识别为 Excel 文件,只是当 Pandas.
要求这样做时却没有
有人知道为什么 Pandas 不读取以 xlrd 作为引擎的文件吗?或者有人可以帮我从 xlrd 中获取 sheet 并将其转换为 Pandas 数据帧吗?
Or can someone help me take the sheet from xlrd and convert it into a
Pandas dataframe?
pd.read_excel
可以拿一本书...
import xlrd
book = xlrd.open_workbook(filename='./file_check/file.xls')
df = pd.read_excel(book, skiprows=5)
print(df)
some column headers
0 1 some foo
1 2 strings bar
2 3 here yes
3 4 too no
我将包含下面的代码,如果您想要 check/handle Excel 文件类型,这些代码可能会有所帮助。也许您可以根据自己的需要进行调整。
代码循环遍历本地文件夹并显示文件和扩展名,然后使用 python-magic
深入研究。它还有一列显示 guessing from mimetypes
但这不是那么好。放大框架的图像,看到一些 .xls
不是扩展所说的。此外,.txt
实际上是一个 Excel 文件。
import pandas as pd
import glob
import mimetypes
import os
# https://pypi.org/project/python-magic/
import magic
path = r'./file_check' # use your path
all_files = glob.glob(path + "/*.*")
data = []
for file in all_files:
name, extension = os.path.splitext(file)
data.append([file, extension, magic.from_file(file, mime=True), mimetypes.guess_type(file)[0]])
df = pd.DataFrame(data, columns=['Path', 'Extension', 'magic.from_file(file, mime=True)', 'mimetypes.guess_type'])
# del df['magic.from_file(file, mime=True)']
df
从那里您可以根据文件的类型过滤文件:
xlsx_file_format = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
xls_file_format = 'application/vnd.ms-excel'
for file in all_files:
if magic.from_file(file, mime=True) == xlsx_file_format:
print('xlsx')
# DO SOMETHING SPECIAL WITH XLSX FILES
elif magic.from_file(file, mime=True) == xls_file_format:
print('xls')
# DO SOMETHING SPECIAL WITH XLS FILES
else:
continue
dfs = []
for file in all_files:
if (magic.from_file(file, mime=True) == xlsx_file_format) or \
(magic.from_file(file, mime=True) == xls_file_format):
# who cares, it all works with this for the demo...
df = pd.read_excel(file, skiprows=5, names=['some', 'column', 'headers'])
dfs.append(df)
print('\nHow many frames did we get from seven files? ', len(dfs))
输出:
xlsx
xls
xls
xlsx
How many frames did we get from seven files? 4
我正在编写一些自动化脚本来处理 Python 中的 Excel 个文件,其中一些是 XLS 格式。这是我尝试使用 Pandas:
的代码片段df = pd.read_excel(contents, engine='xlrd', skiprows=5, names=['some', 'column', 'headers'])
contents
是从 AWS S3 存储桶中提取的文件内容。当这条线运行时,我得到 [ERROR] ValueError: File is not a recognized excel file
.
在解决此问题时,我尝试直接使用 xlrd 访问传播sheet:
book = xlrd.open_workbook(file_contents=contents)
print("Number of worksheets is {}".format(book.nsheets))
print("Worksheet names: {}".format(book.sheet_names()))
这没有错误,所以 xlrd 似乎将其识别为 Excel 文件,只是当 Pandas.
要求这样做时却没有有人知道为什么 Pandas 不读取以 xlrd 作为引擎的文件吗?或者有人可以帮我从 xlrd 中获取 sheet 并将其转换为 Pandas 数据帧吗?
Or can someone help me take the sheet from xlrd and convert it into a Pandas dataframe?
pd.read_excel
可以拿一本书...
import xlrd
book = xlrd.open_workbook(filename='./file_check/file.xls')
df = pd.read_excel(book, skiprows=5)
print(df)
some column headers
0 1 some foo
1 2 strings bar
2 3 here yes
3 4 too no
我将包含下面的代码,如果您想要 check/handle Excel 文件类型,这些代码可能会有所帮助。也许您可以根据自己的需要进行调整。
代码循环遍历本地文件夹并显示文件和扩展名,然后使用 python-magic
深入研究。它还有一列显示 guessing from mimetypes
但这不是那么好。放大框架的图像,看到一些 .xls
不是扩展所说的。此外,.txt
实际上是一个 Excel 文件。
import pandas as pd
import glob
import mimetypes
import os
# https://pypi.org/project/python-magic/
import magic
path = r'./file_check' # use your path
all_files = glob.glob(path + "/*.*")
data = []
for file in all_files:
name, extension = os.path.splitext(file)
data.append([file, extension, magic.from_file(file, mime=True), mimetypes.guess_type(file)[0]])
df = pd.DataFrame(data, columns=['Path', 'Extension', 'magic.from_file(file, mime=True)', 'mimetypes.guess_type'])
# del df['magic.from_file(file, mime=True)']
df
从那里您可以根据文件的类型过滤文件:
xlsx_file_format = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
xls_file_format = 'application/vnd.ms-excel'
for file in all_files:
if magic.from_file(file, mime=True) == xlsx_file_format:
print('xlsx')
# DO SOMETHING SPECIAL WITH XLSX FILES
elif magic.from_file(file, mime=True) == xls_file_format:
print('xls')
# DO SOMETHING SPECIAL WITH XLS FILES
else:
continue
dfs = []
for file in all_files:
if (magic.from_file(file, mime=True) == xlsx_file_format) or \
(magic.from_file(file, mime=True) == xls_file_format):
# who cares, it all works with this for the demo...
df = pd.read_excel(file, skiprows=5, names=['some', 'column', 'headers'])
dfs.append(df)
print('\nHow many frames did we get from seven files? ', len(dfs))
输出:
xlsx
xls
xls
xlsx
How many frames did we get from seven files? 4