如何在 Openpyxl 中快速搜索和编辑 excel 个文件

Question

我有 2 个作品sheet。我需要将 sheet 'Data'（350k 行，字符串）中的每个单元格与另一个 sheet、'Dictionary' 中的单元格进行比较。如果字符串不在 'Dictionary' 中或在 'Dictionary' 的第一列中，则什么也不做。如果它出现在 'Dictionary' 的其他地方，则取相应第一列中的值。然后转到 'Data' 并将其写在它最初出现在 'Data' 的旁边。

正如标题所说，问题是速度。此代码适用于大约 150 行的测试文件，但需要 4 分钟才能完成。因此，将它用于我的文件是不可行的。请告诉我如何加快速度。这是我的第一个 python 代码。

import openpyxl

wb = openpyxl.load_workbook('Test.xlsx')
first_sheet = wb.sheetnames[0]
Data = wb.get_sheet_by_name(first_sheet)
second_sheet = wb.sheetnames[1]
Dictionary = wb.get_sheet_by_name(second_sheet)

for rownum in range(2,Data.max_row+1):
  var1 = Data.cell(row=rownum, column=1).value 
  for rownum1 in range(2,Dictionary.max_row+1):  
    var2 = Dictionary.cell(row=rownum1, column=1).value 
    for colnum2 in range(2,Dictionary.max_column+1):
      var3 = Dictionary.cell(row=rownum1, column=colnum2).value 
      if var1 != var2 and var1 == var3:
       Data.cell(row=rownum, column=4).value = var2
       wb.save('Test.xlsx')
      else:
         None

Answer 1

您可以使用哈希集来解决您的问题，它可以让您在恒定时间内检查值是否存在。

编辑：你想要一个更具体的例子

导入和设置您的文件：

import openpyxl

wb = openpyxl.load_workbook('Test.xlsx')
first_sheet = wb.sheetnames[0]
Data = wb.get_sheet_by_name(first_sheet)
second_sheet = wb.sheetnames[1]
Dictionary = wb.get_sheet_by_name(second_sheet)

将字典中的每个值读入内存，创建一个字典数据结构，将字典中不在第一列中的每个值与该特定行中第一列的值相匹配。

Dict = {}

for row in range(2, Dictionary.max_row + 1):
    for col in range(2, Dictionary.max_column + 1):
        cell_value = Dictionary.cell(row=row, col=col).value
        Dict[cell_value] = Dictionary.cell(row=row, col=1).value

现在遍历数据并使用字典执行操作：

for row in range(2, Data.max_row+1):
    for col in range(2, Data.max_column + 1):
        cell_value = Data.cell(row=row, col=col).value
        if cell_value in Dict: #if it was elsewhere in Dictionary
            #I'm not sure what you meant by next to so here it just overwrites
            #The value with the corresponding 1st row in Dictionary
            Data.cell(row=row, col=col).value = Dict[cell_value] 

wb.save('Test.xlsx') #save once at the end

Answer 2

也许有点晚了，但万一有人遇到同样的麻烦...... 我遇到了同样的问题，所以我将 excelsheet 转换为一个 numpy 二维数组，搜索速度更快了。这是我针对 OP 问题修改的代码：

file = openpyxl.load_workbook(path, data_only=True)
WS_Names= file['Names'] #Worksheet containing the names
NP_Names = np.array(list(WS_Names.values)) #Transformation to numpy 2D Array
WS_Dict = file['Dict'] #Worksheet containing the data
NP_Dict = np.array(list(WS_Dict .values)) #Transformation to numpy 2D Array

names = NP_Names.T[0] #Take the first column containing data

for idx, name in enumerate(names):
    locations = np.column_stack(np.where(name == NP_Dict))
    for row, col in locations:
        if col != 0: # The first column
             WS_Dict.cell(row=idx+1, column=4).value = var2NP_Dict[row,col]

希望对你有帮助:)

如何在 Openpyxl 中快速搜索和编辑 excel 个文件

How to search and edit excel files fast in Openpyxl

python

python-2.7

openpyxl