Openpyxl 优化以测试 2 个工作簿的每个单元格
Openpyxl optimization to test every cell of 2 workbooks
我尝试比较 2 个 xlsx 文件(如果 2 个单元格不同(在值或颜色上,我想打印错误,这就是为什么我决定尝试 Openpyxl 而不是 pandas 或其他)。
我设法有一个脚本来完成这项工作,但我不熟悉 Python/this 包/一般编程,我想回顾一下这段代码。
from openpyxl import load_workbook
#i made 2 lists of files paths, i want to iterate on those lists to compare path1 vs path 3 and path2 v path 4 etc
FileList1 = [path1, path2...]
FileList2 = [path3, path4....]
#zip allows me to "open" 2 files in parallel and to iterate 2 by 2
for a, b in zip(FileList1, FileList2):
#Opening the 2 wb to compare (i can't use the read_only option, i don't know why...)
wb1 = load_workbook(a, data_only = True )
wb2 = load_workbook(b, data_only = True )
#sheetnames comparison to check if the files have the same structure, if not, next couple of files
if wb1.sheetnames == wb2.sheetnames :
#sheetnames are the same, i get the ws names in a list
WsList = wb1.sheetnames
#for each worksheet of my list
for ws in WsList:
#Verification of the structure of the sheet --> if max rows and max columns are the same
if wb1[ws].max_row == wb2[ws].max_row and wb1[ws].max_column == wb2[ws].max_column:
LastRow = wb1[ws].max_row + 1
LastColumn = wb1[ws].max_column + 1
#This is the part that i don't like, it's very "VBA". For each column and for each row, i check in the 2 files if the values and colors are the same
for y in range(1,LastColumn):
for x in range (1,LastRow) :
#The .cell(row=x, column=y) is supposedly slow, but i don't know any other way
Value1 = wb1[ws].cell(row=x, column=y).value
Value2 = wb2[ws].cell(row=x, column=y).value
Color1 = wb1[ws].cell(row=x, column=y).fill.fgColor
Color2 = wb2[ws].cell(row=x, column=y).fill.fgColor
if Value1 != Value2 :
print("Value Error")
if Color1 != Color2 :
print("Color Error")
else :
print("Structural error")
else :
print("Structural error")
对于具有 100k 个单元格的 2 个文件进行比较,此脚本需要 5 或 6 秒才能在我的笔记本电脑上 运行。
我知道 .xslx 很复杂,我不能指望有 .csv 速度,但我认为这段代码不是很“pythonic”。
我尝试使用 iter_rows 和 iter_columns 但我没有得到预期的结果。
任何人都可以给我一些反馈吗?
使用 Python 3.8.6 和 openpyxl 版本测试:3.0.5
在查理·克拉克的回答后编辑:
这是启用了“read_only”选项的编辑代码,以及另一种遍历行、列和工作表的方法:
from openpyxl import load_workbook
#i made 2 lists of files paths, i want to iterate on those lists to compare path1 vs path 3 and path2 v path 4 etc
FileList1 = [filepath1, filepath2...]
FileList2 = [filepath3, filepath4...]
#zip allows me to "open" 2 files in parallel and to iterate 2 by 2
for a, b in zip(FileList1, FileList2):
#Opening the 2 wb to compare (i can't use the read_only option, i don't know why...)
wb1 = load_workbook(a, data_only = True, read_only=True )
wb2 = load_workbook(b, data_only = True, read_only=True )
#sheetnames comparison to check if the files have the same structure, if not, next couple of files
if wb1.sheetnames == wb2.sheetnames :
WsList = wb1.sheetnames
#for each worksheet
for ws in WsList:
#Verification of the structure of the sheet --> if max rows and max columns are the same
if wb1[ws].max_row == wb2[ws].max_row and wb1[ws].max_column == wb2[ws].max_column:
ws1 = wb1[ws]
ws2 = wb2[ws]
for row1, row2 in zip(ws1, ws2):
for c1, c2 in zip(row1, row2):
#i had an issue with some EmptyCells that i couldn't resolve otherwise
if c1.value != None:
Value1 = c1.value
Value2 = c2.value
Color1 = c1.fill.fgColor.rgb
Color2 = c2.fill.fgColor.rgb
if not (Value1 == Value2):
print('Value error')
if not (Color1 == Color2 ):
print("color error")
else :
print(" ws structural error")
else :
print(" wb structural error")
我的文件速度提高了 2 倍(很大一部分可能是因为 read_only 选项)
在不知道文件的情况下总是很难谈论性能,但 6 秒对我来说似乎很好,假设这包括加载库和读取文件的时间,这在大多数情况下很容易。
使用只读模式可能会有所改进,尽管那里的访问方式略有不同。
for row1, row2 in zip(ws1, ws2):
for c1, c2 in zip(row1, row2)
您应该快速浏览不同工作簿的工作表,然后快速浏览行和检查 if not (c1.value == c2.value and c1.fill.fgColor == c2.fillfgColor)
,因为值比较会更快。如果您知道自己在做什么,比较相关的 StyleArray 会比使用样式对象更快,但我不认为这是这里的限制因素。
如有疑问,请在您的代码上使用 cProfile。
我尝试比较 2 个 xlsx 文件(如果 2 个单元格不同(在值或颜色上,我想打印错误,这就是为什么我决定尝试 Openpyxl 而不是 pandas 或其他)。
我设法有一个脚本来完成这项工作,但我不熟悉 Python/this 包/一般编程,我想回顾一下这段代码。
from openpyxl import load_workbook
#i made 2 lists of files paths, i want to iterate on those lists to compare path1 vs path 3 and path2 v path 4 etc
FileList1 = [path1, path2...]
FileList2 = [path3, path4....]
#zip allows me to "open" 2 files in parallel and to iterate 2 by 2
for a, b in zip(FileList1, FileList2):
#Opening the 2 wb to compare (i can't use the read_only option, i don't know why...)
wb1 = load_workbook(a, data_only = True )
wb2 = load_workbook(b, data_only = True )
#sheetnames comparison to check if the files have the same structure, if not, next couple of files
if wb1.sheetnames == wb2.sheetnames :
#sheetnames are the same, i get the ws names in a list
WsList = wb1.sheetnames
#for each worksheet of my list
for ws in WsList:
#Verification of the structure of the sheet --> if max rows and max columns are the same
if wb1[ws].max_row == wb2[ws].max_row and wb1[ws].max_column == wb2[ws].max_column:
LastRow = wb1[ws].max_row + 1
LastColumn = wb1[ws].max_column + 1
#This is the part that i don't like, it's very "VBA". For each column and for each row, i check in the 2 files if the values and colors are the same
for y in range(1,LastColumn):
for x in range (1,LastRow) :
#The .cell(row=x, column=y) is supposedly slow, but i don't know any other way
Value1 = wb1[ws].cell(row=x, column=y).value
Value2 = wb2[ws].cell(row=x, column=y).value
Color1 = wb1[ws].cell(row=x, column=y).fill.fgColor
Color2 = wb2[ws].cell(row=x, column=y).fill.fgColor
if Value1 != Value2 :
print("Value Error")
if Color1 != Color2 :
print("Color Error")
else :
print("Structural error")
else :
print("Structural error")
对于具有 100k 个单元格的 2 个文件进行比较,此脚本需要 5 或 6 秒才能在我的笔记本电脑上 运行。
我知道 .xslx 很复杂,我不能指望有 .csv 速度,但我认为这段代码不是很“pythonic”。
我尝试使用 iter_rows 和 iter_columns 但我没有得到预期的结果。
任何人都可以给我一些反馈吗?
使用 Python 3.8.6 和 openpyxl 版本测试:3.0.5
在查理·克拉克的回答后编辑: 这是启用了“read_only”选项的编辑代码,以及另一种遍历行、列和工作表的方法:
from openpyxl import load_workbook
#i made 2 lists of files paths, i want to iterate on those lists to compare path1 vs path 3 and path2 v path 4 etc
FileList1 = [filepath1, filepath2...]
FileList2 = [filepath3, filepath4...]
#zip allows me to "open" 2 files in parallel and to iterate 2 by 2
for a, b in zip(FileList1, FileList2):
#Opening the 2 wb to compare (i can't use the read_only option, i don't know why...)
wb1 = load_workbook(a, data_only = True, read_only=True )
wb2 = load_workbook(b, data_only = True, read_only=True )
#sheetnames comparison to check if the files have the same structure, if not, next couple of files
if wb1.sheetnames == wb2.sheetnames :
WsList = wb1.sheetnames
#for each worksheet
for ws in WsList:
#Verification of the structure of the sheet --> if max rows and max columns are the same
if wb1[ws].max_row == wb2[ws].max_row and wb1[ws].max_column == wb2[ws].max_column:
ws1 = wb1[ws]
ws2 = wb2[ws]
for row1, row2 in zip(ws1, ws2):
for c1, c2 in zip(row1, row2):
#i had an issue with some EmptyCells that i couldn't resolve otherwise
if c1.value != None:
Value1 = c1.value
Value2 = c2.value
Color1 = c1.fill.fgColor.rgb
Color2 = c2.fill.fgColor.rgb
if not (Value1 == Value2):
print('Value error')
if not (Color1 == Color2 ):
print("color error")
else :
print(" ws structural error")
else :
print(" wb structural error")
我的文件速度提高了 2 倍(很大一部分可能是因为 read_only 选项)
在不知道文件的情况下总是很难谈论性能,但 6 秒对我来说似乎很好,假设这包括加载库和读取文件的时间,这在大多数情况下很容易。
使用只读模式可能会有所改进,尽管那里的访问方式略有不同。
for row1, row2 in zip(ws1, ws2):
for c1, c2 in zip(row1, row2)
您应该快速浏览不同工作簿的工作表,然后快速浏览行和检查 if not (c1.value == c2.value and c1.fill.fgColor == c2.fillfgColor)
,因为值比较会更快。如果您知道自己在做什么,比较相关的 StyleArray 会比使用样式对象更快,但我不认为这是这里的限制因素。
如有疑问,请在您的代码上使用 cProfile。