加速openpyxl阅读
Speed up openpyxl reading
我正在开发一个非常大的程序,它再次从一个非常大的 excel 电子表格中获取数据。
由于某些奇怪的原因,加载这个大 excel 文件需要很长时间,我希望以某种方式加快它的速度。
我做了自己的研究并尝试了 read_only,但这并没有让我读取任何单元格(至少如下所示的方式)。非常欢迎任何建议。
def LoadSheet(Filename,worksheet_nr):
import time as t
st = t.clock()
wb = xl.load_workbook(str(Filename))#workbook, loading workbook
Sheets = wb.get_sheet_names()#Worksheets, shows all sheets in the workbook loaded
ActiveSheet = wb.get_sheet_by_name(Sheets[worksheet_nr])#Selects the first sheet
print round(t.clock() - st,3), "seconds of loading time."
print "Sheet is loaded and reading..."
return ActiveSheet
def CountRC_nr(ActiveSheet, StartR,StartC):
R = StartR#Selected starting Row number
C = StartC#Selected starting Column number
Val = None #Value in cell (Right now, none is selected, thus zero value)
while (Val != "None"):
Val = str(ActiveSheet.cell(row=R,column=C).value)
R += 1
NumberRows = R - StartR - 1
Val = None
R=StartR
while (Val != "None"):
Val = str(ActiveSheet.cell(row=R,column=C).value)
C += 1
NumberColumns = C - StartC - 1
return NumberRows, NumberColumns
由于有人要求看我如何使用read_only,这里修改代码:
def LoadSheet(Filename,worksheet_nr):
import time as t
st = t.clock()
wb = xl.load_workbook(str(Filename), read_only = True)#workbook, loading workbook
Sheets = wb.get_sheet_names()#Worksheets, shows all sheets in the workbook loaded
ActiveSheet = wb.get_sheet_by_name(Sheets[worksheet_nr])#Selects the first sheet
print round(t.clock() - st,3), "seconds of loading time."
print "Sheet is loaded and reading..."
return ActiveSheet
def CountRC_nr(ActiveSheet, StartR,StartC):
R = StartR#Selected starting Row number
C = StartC#Selected starting Column number
Val = None #Value in cell (Right now, none is selected, thus zero value)
while (Val != "None"):
Val = str(ActiveSheet.cell(row=R,column=C).value)
R += 1
NumberRows = R - StartR - 1
Val = None
R=StartR
while (Val != "None"):
Val = str(ActiveSheet.cell(row=R,column=C).value)
C += 1
NumberColumns = C - StartC - 1
return NumberRows, NumberColumns
正如我已经说过的,您正在与 API 作对。如果工作表的维度已知,那么您可以访问它们,否则您可以构建自己的计数器,这没什么意义。
只读模式对于单个查找非常快,但由于 每个 查找涉及解析行中的所有单元格直到感兴趣的单元格,因此对于很多人来说效率非常低它们,例如您的代码正在使用。
wb = load_workbook(filename, read_only=True)
for ws in wb:
print(ws.dimensions)
for row in ws.rows:
for cell in row:
print(cell.value) # do whatever you want with the cell
您可以使用 (cell.value for cell in row)
轻松地将行转换为 NumPy 数组,但 Pandas 库还将工作表直接读取到数据框中。
我正在开发一个非常大的程序,它再次从一个非常大的 excel 电子表格中获取数据。 由于某些奇怪的原因,加载这个大 excel 文件需要很长时间,我希望以某种方式加快它的速度。 我做了自己的研究并尝试了 read_only,但这并没有让我读取任何单元格(至少如下所示的方式)。非常欢迎任何建议。
def LoadSheet(Filename,worksheet_nr):
import time as t
st = t.clock()
wb = xl.load_workbook(str(Filename))#workbook, loading workbook
Sheets = wb.get_sheet_names()#Worksheets, shows all sheets in the workbook loaded
ActiveSheet = wb.get_sheet_by_name(Sheets[worksheet_nr])#Selects the first sheet
print round(t.clock() - st,3), "seconds of loading time."
print "Sheet is loaded and reading..."
return ActiveSheet
def CountRC_nr(ActiveSheet, StartR,StartC):
R = StartR#Selected starting Row number
C = StartC#Selected starting Column number
Val = None #Value in cell (Right now, none is selected, thus zero value)
while (Val != "None"):
Val = str(ActiveSheet.cell(row=R,column=C).value)
R += 1
NumberRows = R - StartR - 1
Val = None
R=StartR
while (Val != "None"):
Val = str(ActiveSheet.cell(row=R,column=C).value)
C += 1
NumberColumns = C - StartC - 1
return NumberRows, NumberColumns
由于有人要求看我如何使用read_only,这里修改代码:
def LoadSheet(Filename,worksheet_nr):
import time as t
st = t.clock()
wb = xl.load_workbook(str(Filename), read_only = True)#workbook, loading workbook
Sheets = wb.get_sheet_names()#Worksheets, shows all sheets in the workbook loaded
ActiveSheet = wb.get_sheet_by_name(Sheets[worksheet_nr])#Selects the first sheet
print round(t.clock() - st,3), "seconds of loading time."
print "Sheet is loaded and reading..."
return ActiveSheet
def CountRC_nr(ActiveSheet, StartR,StartC):
R = StartR#Selected starting Row number
C = StartC#Selected starting Column number
Val = None #Value in cell (Right now, none is selected, thus zero value)
while (Val != "None"):
Val = str(ActiveSheet.cell(row=R,column=C).value)
R += 1
NumberRows = R - StartR - 1
Val = None
R=StartR
while (Val != "None"):
Val = str(ActiveSheet.cell(row=R,column=C).value)
C += 1
NumberColumns = C - StartC - 1
return NumberRows, NumberColumns
正如我已经说过的,您正在与 API 作对。如果工作表的维度已知,那么您可以访问它们,否则您可以构建自己的计数器,这没什么意义。
只读模式对于单个查找非常快,但由于 每个 查找涉及解析行中的所有单元格直到感兴趣的单元格,因此对于很多人来说效率非常低它们,例如您的代码正在使用。
wb = load_workbook(filename, read_only=True)
for ws in wb:
print(ws.dimensions)
for row in ws.rows:
for cell in row:
print(cell.value) # do whatever you want with the cell
您可以使用 (cell.value for cell in row)
轻松地将行转换为 NumPy 数组,但 Pandas 库还将工作表直接读取到数据框中。