迭代 xlsx 文件并删除 unicode python openpyxl

Question

我正在尝试将我计算机上的所有 excel 文件转换为 CSV 文件（sheet by sheet）。一些 .xlsx 文件很大（超过 100MB）。我还有几个问题：
1.我删除非unicode字符的功能很慢
2. 我不确定我是否正确使用了 openpyxl 的迭代，因为我仍在使用大量内存并且担心如果我真的让这个东西运行，它会遇到内存错误
另外，总体上寻求任何编码帮助，因为我对一般编码仍然很陌生。

import csv
from formic import FileSet
from openpyxl import load_workbook
import re
from os.path import basename
import os
import string


def uclean(s): # Clean out non-unicode chars for csv.writer - SLOW
    try:
        return ''.join(char for char in s if char in string.printable).strip()
    except:
        return ''

def fclean(s): # Clean out non-filename-safe chars
    return ''.join([c for c in s if re.match(r'\w', c)])

xlsx_files = FileSet(directory='C:\', include='**\*.xlsx') # the whole computer's excel files
for filename in xlsx_files:
    wb = load_workbook(filename, use_iterators=True, read_only=True)  # This is still using > 600 MBs
    for sheet in wb.worksheets:
        i = wb.worksheets.index(sheet)
        bf = os.path.splitext(
            basename(filename))[0]
        sn = fclean(str(wb.get_sheet_names()[i]))
        f = bf + '_' + sn + '.csv'
        if not os.path.exists(f):
            with open(f, 'wb') as outf:
                out_writer = csv.writer(outf)
                for row in sheet.iter_rows():
                    out_writer.writerow([uclean(cell.value) for cell in row])

Answer 1

使用encode会快很多：

#lines is some French text
In [80]: %timeit [s.encode('ascii', errors='ignore').strip() for s in lines]
10000 loops, best of 3: 15.3 µs per loop

In [81]: %timeit [uclean(s) for s in lines]                          
1000 loops, best of 3: 522 µs per loop

关于你的 openpyxl 问题，我必须回复你——我现在唯一能想到的是，一次只加载一个工作表而不是整个工作簿是可能的.请记住，由于 wb 是循环的局部变量，因此每次迭代都会将其替换为一个新对象，因此您不会使用 additional每个文件 600mb 内存。

Answer 2

只读模式确实一次读取一个单元格，因此内存使用量最少。但是，基于您想将所有文本转换为 ascii，我想知道原因是否是 Excel 文件中有很多文本。 Excel 采用优化，将所有字符串存储在单元格引用的大列表中。如果您有很多独特的字符串，那么这些可能是任何内存问题的根源，因为我们必须将它们保存在内存中以便能够读取它们。

关于转换：您可以使用包装器保存为 UTF-8，从而删除任何内联编码。

迭代 xlsx 文件并删除 unicode python openpyxl

Iterating xlsx files and removing unicode python openpyxl

python

csv

iteration

python-2.7

openpyxl