遍历 openpyxl 中只读工作簿中的列
Iterate through columns in Read-only workbook in openpyxl
我有一个有点大的 .xlsx 文件 - 19 列,5185 行。我想打开文件,读取一列中的所有值,对这些值执行一些操作,然后在同一工作簿中创建一个新列并写出修改后的值。因此,我需要能够在同一个文件中读写。
我的原始代码是这样做的:
def readExcel(doc):
wb = load_workbook(generalpath + exppath + doc)
ws = wb["Sheet1"]
# iterate through the columns to find the correct one
for col in ws.iter_cols(min_row=1, max_row=1):
for mycell in col:
if mycell.value == "PerceivedSound.RESP":
origCol = mycell.column
# get the column letter for the first empty column to output the new values
newCol = utils.get_column_letter(ws.max_column+1)
# iterate through the rows to get the value from the original column,
# do something to that value, and output it in the new column
for myrow in range(2, ws.max_row+1):
myrow = str(myrow)
# do some stuff to make the new value
cleanedResp = doStuff(ws[origCol + myrow].value)
ws[newCol + myrow] = cleanedResp
wb.save(doc)
但是,python 在第 3853 行之后抛出内存错误,因为工作簿太大。 openpyxl 文档说使用只读模式 (https://openpyxl.readthedocs.io/en/latest/optimized.html) 来处理大型工作簿。我现在正在尝试使用它;但是,当我添加 read_only = True 参数时,似乎无法遍历列:
def readExcel(doc):
wb = load_workbook(generalpath + exppath + doc, read_only=True)
ws = wb["Sheet1"]
for col in ws.iter_cols(min_row=1, max_row=1):
#etc.
python 抛出这个错误:
AttributeError: 'ReadOnlyWorksheet' 对象没有属性 'iter_cols'
如果我将上面代码段的最后一行更改为:
for col in ws.columns:
python 抛出同样的错误:
AttributeError: 'ReadOnlyWorksheet' 对象没有属性 'columns'
迭代行很好(并且包含在我上面链接的文档中):
for col in ws.rows:
(没有错误)
询问 AttritubeError 但解决方案是删除只读模式,这对我不起作用,因为 openpyxl 不会在非只读模式下读取我的整个工作簿。
那么:如何循环访问大型工作簿中的列?
我还没有遇到过这个,但是一旦我可以遍历列我就会遇到:如果工作簿很大,我如何读写同一个工作簿?
谢谢!
根据documentation,ReadOnly模式只支持行读(未实现列读)。但这并不难解决:
wb2 = Workbook(write_only=True)
ws2 = wb2.create_sheet()
# find what column I need
colcounter = 0
for row in ws.rows:
for cell in row:
if cell.value == "PerceivedSound.RESP":
break
colcounter += 1
# cells are apparently linked to the parent workbook meta
# this will retain only values; you'll need custom
# row constructor if you want to retain more
row2 = [cell.value for cell in row]
ws2.append(row2) # preserve the first row in the new file
break # stop after first row
for row in ws.rows:
row2 = [cell.value for cell in row]
row2.append(doStuff(row2[colcounter]))
ws2.append(row2) # write a new row to the new wb
wb2.save('newfile.xlsx')
wb.close()
wb2.close()
# copy `newfile.xlsx` to `generalpath + exppath + doc`
# Either using os.system,subprocess.popen, or shutil.copy2()
您将无法写入同一个工作簿,但如上所示,您可以打开一个新工作簿(在只写模式下),写入并使用 OS 复制覆盖旧文件。
如果工作表只有大约 100,000 个单元格,那么您应该没有任何记忆问题。您可能应该进一步调查。
iter_cols()
在只读模式下不可用,因为它需要对基础 XML 文件进行持续且非常低效的重新解析。但是,使用 zip
.
将行从 iter_rows()
转换为列相对容易
def _iter_cols(self, min_col=None, max_col=None, min_row=None,
max_row=None, values_only=False):
yield from zip(*self.iter_rows(
min_row=min_row, max_row=max_row,
min_col=min_col, max_col=max_col, values_only=values_only))
import types
for sheet in workbook:
sheet.iter_cols = types.MethodType(_iter_cols, sheet)
我有一个有点大的 .xlsx 文件 - 19 列,5185 行。我想打开文件,读取一列中的所有值,对这些值执行一些操作,然后在同一工作簿中创建一个新列并写出修改后的值。因此,我需要能够在同一个文件中读写。
我的原始代码是这样做的:
def readExcel(doc):
wb = load_workbook(generalpath + exppath + doc)
ws = wb["Sheet1"]
# iterate through the columns to find the correct one
for col in ws.iter_cols(min_row=1, max_row=1):
for mycell in col:
if mycell.value == "PerceivedSound.RESP":
origCol = mycell.column
# get the column letter for the first empty column to output the new values
newCol = utils.get_column_letter(ws.max_column+1)
# iterate through the rows to get the value from the original column,
# do something to that value, and output it in the new column
for myrow in range(2, ws.max_row+1):
myrow = str(myrow)
# do some stuff to make the new value
cleanedResp = doStuff(ws[origCol + myrow].value)
ws[newCol + myrow] = cleanedResp
wb.save(doc)
但是,python 在第 3853 行之后抛出内存错误,因为工作簿太大。 openpyxl 文档说使用只读模式 (https://openpyxl.readthedocs.io/en/latest/optimized.html) 来处理大型工作簿。我现在正在尝试使用它;但是,当我添加 read_only = True 参数时,似乎无法遍历列:
def readExcel(doc):
wb = load_workbook(generalpath + exppath + doc, read_only=True)
ws = wb["Sheet1"]
for col in ws.iter_cols(min_row=1, max_row=1):
#etc.
python 抛出这个错误: AttributeError: 'ReadOnlyWorksheet' 对象没有属性 'iter_cols'
如果我将上面代码段的最后一行更改为:
for col in ws.columns:
python 抛出同样的错误: AttributeError: 'ReadOnlyWorksheet' 对象没有属性 'columns'
迭代行很好(并且包含在我上面链接的文档中):
for col in ws.rows:
(没有错误)
那么:如何循环访问大型工作簿中的列?
我还没有遇到过这个,但是一旦我可以遍历列我就会遇到:如果工作簿很大,我如何读写同一个工作簿?
谢谢!
根据documentation,ReadOnly模式只支持行读(未实现列读)。但这并不难解决:
wb2 = Workbook(write_only=True)
ws2 = wb2.create_sheet()
# find what column I need
colcounter = 0
for row in ws.rows:
for cell in row:
if cell.value == "PerceivedSound.RESP":
break
colcounter += 1
# cells are apparently linked to the parent workbook meta
# this will retain only values; you'll need custom
# row constructor if you want to retain more
row2 = [cell.value for cell in row]
ws2.append(row2) # preserve the first row in the new file
break # stop after first row
for row in ws.rows:
row2 = [cell.value for cell in row]
row2.append(doStuff(row2[colcounter]))
ws2.append(row2) # write a new row to the new wb
wb2.save('newfile.xlsx')
wb.close()
wb2.close()
# copy `newfile.xlsx` to `generalpath + exppath + doc`
# Either using os.system,subprocess.popen, or shutil.copy2()
您将无法写入同一个工作簿,但如上所示,您可以打开一个新工作簿(在只写模式下),写入并使用 OS 复制覆盖旧文件。
如果工作表只有大约 100,000 个单元格,那么您应该没有任何记忆问题。您可能应该进一步调查。
iter_cols()
在只读模式下不可用,因为它需要对基础 XML 文件进行持续且非常低效的重新解析。但是,使用 zip
.
iter_rows()
转换为列相对容易
def _iter_cols(self, min_col=None, max_col=None, min_row=None,
max_row=None, values_only=False):
yield from zip(*self.iter_rows(
min_row=min_row, max_row=max_row,
min_col=min_col, max_col=max_col, values_only=values_only))
import types
for sheet in workbook:
sheet.iter_cols = types.MethodType(_iter_cols, sheet)