与 xlrd 相比,使用 openpyxl 读取 Excel 文件要慢很多

Reading Excel file is magnitudes slower using openpyxl compared to xlrd

我有一个 Excel 电子表格,我需要每天将其导入 SQL 服务器。该电子表格将包含大约 50 列的大约 250,000 行。我已经使用几乎相同的代码使用 openpyxlxlrd 对两者进行了测试。

这是我正在使用的代码(减去调试语句):

import xlrd
import openpyxl

def UseXlrd(file_name):
    workbook = xlrd.open_workbook(file_name, on_demand=True)
    worksheet = workbook.sheet_by_index(0)
    first_row = []
    for col in range(worksheet.ncols):
        first_row.append(worksheet.cell_value(0,col))
    data = []
    for row in range(1, worksheet.nrows):
        record = {}
        for col in range(worksheet.ncols):
            if isinstance(worksheet.cell_value(row,col), str):
                record[first_row[col]] = worksheet.cell_value(row,col).strip()
            else:
                record[first_row[col]] = worksheet.cell_value(row,col)
        data.append(record)
    return data


def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    first_row = []
    for col in range(1,sheet.max_column+1):
        first_row.append(sheet.cell(row=1,column=col).value)
    data = []
    for r in range(2,sheet.max_row+1):
        record = {}
        for col in range(sheet.max_column):
            if isinstance(sheet.cell(row=r,column=col+1).value, str):
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip()
            else:
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value
        data.append(record)
    return data

xlrd_results = UseXlrd('foo.xls')
openpyxl_resuts = UseOpenpyxl('foo.xls')

传递包含 3500 行的相同 Excel 文件给出截然不同的 运行 时间。使用 xlrd 我可以在 2 秒内将整个文件读入字典列表。使用 openpyxl 我得到以下结果:

Reading Excel File...
Read 100 lines in 114.14509415626526 seconds
Read 200 lines in 471.43183994293213 seconds
Read 300 lines in 982.5288782119751 seconds
Read 400 lines in 1729.3348784446716 seconds
Read 500 lines in 2774.886833190918 seconds
Read 600 lines in 4384.074863195419 seconds
Read 700 lines in 6396.7723388671875 seconds
Read 800 lines in 7998.775000572205 seconds
Read 900 lines in 11018.460735321045 seconds

虽然我可以在最终脚本中使用 xlrd,但由于各种问题(即 int 读作 float、date 读作 int、datetime 读作 float),我将不得不对大量格式进行硬编码.由于我需要为更多导入重用此代码,因此尝试对特定列进行硬编码以正确设置它们的格式并且必须在 4 个不同的脚本中维护相似的代码是没有意义的。

关于如何进行的任何建议?

您可以 iterate 超过 sheet:

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [cell.value for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = cell.value
        data.append(record)
    return data

这应该适用于大文件。如果列表,您可能希望对结果进行分块 data 变得太大了。

现在 openpyxl 版本的运行时间大约是 xlrd 版本的两倍:

%timeit xlrd_results = UseXlrd('foo.xlsx')
1 loops, best of 3: 3.38 s per loop

%timeit openpyxl_results = UseOpenpyxl('foo.xlsx')
1 loops, best of 3: 6.87 s per loop

请注意,xlrd 和 openpyxl 对整数和浮点数的解释可能略有不同。对于我的测试数据,我需要添加 float() 以使输出具有可比性:

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [float(cell.value) for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = float(cell.value)
        data.append(record)
    return data

现在,两个版本对我的测试数据给出了相同的结果:

>>> xlrd_results == openpyxl_results
True

在我看来它是 Pandas 模块的完美候选者:

import pandas as pd
import sqlalchemy
import pyodbc

# pyodbc
#
# assuming the following:
# username: scott
# password: tiger
# DSN: mydsn
engine = create_engine('mssql+pyodbc://scott:tiger@mydsn')

# pymssql
#
#engine = create_engine('mssql+pymssql://scott:tiger@hostname:port/dbname')


df = pd.read_excel('foo.xls')

# write the DataFrame to a table in the sql database
df.to_sql("table_name", engine)

DataFrame.to_sql() 函数的说明

PS应该是相当快而且非常容易使用

您多次调用 "sheet.max_column" 或 "sheet.max_row"。不要那样做。只需调用一次。 如果你在for循环中调用它,每次它会计算一次max_column或max_row。

我修改如下供大家参考:

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    max_col = sheet.max_column
    max_row = sheet.max_row
    first_row = []
    for col in range(1,max_col +1):
        first_row.append(sheet.cell(row=1,column=col).value)
    data = []
    for r in range(2,max_row +1):
        record = {}
        for col in range(max_col):
            if isinstance(sheet.cell(row=r,column=col+1).value, str):
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value.strip()
            else:
                record[first_row[col]] = sheet.cell(row=r,column=col+1).value
        data.append(record)
    return data