Openpyxl 优化以测试 2 个工作簿的每个单元格

Openpyxl optimization to test every cell of 2 workbooks

我尝试比较 2 个 xlsx 文件(如果 2 个单元格不同(在值或颜色上,我想打印错误,这就是为什么我决定尝试 Openpyxl 而不是 pandas 或其他)。

我设法有一个脚本来完成这项工作,但我不熟悉 Python/this 包/一般编程,我想回顾一下这段代码。

from openpyxl import load_workbook 

#i made 2 lists of files paths, i want to iterate on those lists to compare path1 vs path 3 and path2 v path 4 etc
FileList1 = [path1, path2...]                  
FileList2 = [path3, path4....]


#zip allows me to "open" 2 files in parallel and to iterate 2 by 2
for a, b in zip(FileList1, FileList2): 
    #Opening the 2 wb to compare (i can't use the read_only option, i don't know why...)                                                                          
    wb1 = load_workbook(a,  data_only = True )                                                                            
    wb2 = load_workbook(b,  data_only = True )     
   
 #sheetnames comparison to check if the files have the same structure, if not, next couple of files 

    if wb1.sheetnames == wb2.sheetnames :
        #sheetnames are the same, i get the ws names in a list
        WsList = wb1.sheetnames  

        #for each worksheet of my list
        for ws in WsList:  
            #Verification of the structure of the sheet --> if max rows and max columns are the same    
            if wb1[ws].max_row == wb2[ws].max_row and wb1[ws].max_column == wb2[ws].max_column:                            

                LastRow = wb1[ws].max_row + 1
                LastColumn = wb1[ws].max_column + 1
            
                #This is the part that i don't like, it's very "VBA". For each column and for each row, i check in the 2 files if the values and colors are the same
                for y in range(1,LastColumn):                                                                                    
                    for x in range (1,LastRow) :  
                        
                        #The .cell(row=x, column=y) is supposedly slow, but i don't know any other way

                        Value1  =  wb1[ws].cell(row=x, column=y).value   
                        Value2  =  wb2[ws].cell(row=x, column=y).value 
                        Color1 =  wb1[ws].cell(row=x, column=y).fill.fgColor 
                        Color2 =  wb2[ws].cell(row=x, column=y).fill.fgColor 

                        if Value1 != Value2 : 
                            print("Value Error")

                        if Color1 != Color2 :     
                            print("Color Error")          
   
            else :
                
                print("Structural error")
                                                                                 
    else :
        print("Structural error")

对于具有 100k 个单元格的 2 个文件进行比较,此脚本需要 5 或 6 秒才能在我的笔记本电脑上 运行。

我知道 .xslx 很复杂,我不能指望有 .csv 速度,但我认为这段代码不是很“pythonic”。

我尝试使用 iter_rows 和 iter_columns 但我没有得到预期的结果。

任何人都可以给我一些反馈吗?

使用 Python 3.8.6 和 openpyxl 版本测试:3.0.5

在查理·克拉克的回答后编辑: 这是启用了“read_only”选项的编辑代码,以及另一种遍历行、列和工作表的方法:

from openpyxl import load_workbook 

#i made 2 lists of files paths, i want to iterate on those lists to compare path1 vs path 3 and path2 v path 4 etc
FileList1 =  [filepath1, filepath2...]                
FileList2 = [filepath3, filepath4...]

#zip allows me to "open" 2 files in parallel and to iterate 2 by 2
for a, b in zip(FileList1, FileList2): 
    #Opening the 2 wb to compare (i can't use the read_only option, i don't know why...)                                                                          
    wb1 = load_workbook(a,  data_only = True, read_only=True )  
    wb2 = load_workbook(b,  data_only = True, read_only=True )      
   
 #sheetnames comparison to check if the files have the same structure, if not, next couple of files 
    if wb1.sheetnames == wb2.sheetnames :
        WsList = wb1.sheetnames
    
        #for each worksheet 
        for ws in WsList:
           
           #Verification of the structure of the sheet --> if max rows and max columns are the same    
            if wb1[ws].max_row == wb2[ws].max_row and wb1[ws].max_column == wb2[ws].max_column: 

                ws1 = wb1[ws]
                ws2 = wb2[ws]
                
                for row1, row2 in zip(ws1, ws2):
                    for c1, c2 in zip(row1, row2): 

                        #i had an issue with some EmptyCells that i couldn't resolve otherwise
                        if c1.value != None: 

                            Value1 = c1.value
                            Value2 = c2.value

                            Color1 = c1.fill.fgColor.rgb
                            Color2 = c2.fill.fgColor.rgb

                            if not (Value1 == Value2):
                                print('Value error')     
                            
                            if not (Color1 == Color2 ):
                                print("color error")     
            else :
                print(" ws structural error")                                                                    
    else :
        print(" wb structural error")                
    
  
           

我的文件速度提高了 2 倍(很大一部分可能是因为 read_only 选项)

在不知道文件的情况下总是很难谈论性能,但 6 秒对我来说似乎很好,假设这包括加载库和读取文件的时间,这在大多数情况下很容易。

使用只读模式可能会有所改进,尽管那里的访问方式略有不同。

       for row1, row2 in zip(ws1, ws2):
          for c1, c2 in zip(row1, row2)

您应该快速浏览不同工作簿的工作表,然后快速浏览行和检查 if not (c1.value == c2.value and c1.fill.fgColor == c2.fillfgColor),因为值比较会更快。如果您知道自己在做什么,比较相关的 StyleArray 会比使用样式对象更快,但我不认为这是这里的限制因素。

如有疑问,请在您的代码上使用 cProfile。