python: 正在转换损坏的 xls 文件

Question

我已经从 SAP 应用程序下载了几个销售数据集。 SAP 已自动将数据转换为 .XLS 文件。每当我使用 Pandas 库打开它时，我都会收到以下错误：

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\xff\xfe\r\x00\n\x00\r\x00'

当我使用 MSEXCEL 打开 .XLS 文件时，它显示一个弹出窗口，说 file is corrupt or unsupported extension do you want to continue 当我单击 'Yes' 时它显示了正确的数据。当我使用 msexcel 将文件再次保存为 .xls 时，我可以使用 Pandas.

所以，我尝试使用 os.rename() 重命名文件，但它不起作用。我尝试打开文件并删除 \xff\xfe\r\x00\n\x00\r\x00，但它仍然有效。

解决方法是打开 MSEXCEL 并手动将文件再次保存为 .xls，有什么方法可以自动执行此操作。请帮忙。

Answer 1

最后我将损坏的 .xls 转换为正确的 .xls 文件。以下是代码：

# Changing the data types of all strings in the module at once
from __future__ import unicode_literals
# Used to save the file as excel workbook
# Need to install this library
from xlwt import Workbook
# Used to open to corrupt excel file
import io

filename = r'SALEJAN17.xls'
# Opening the file using 'utf-16' encoding
file1 = io.open(filename, "r", encoding="utf-16")
data = file1.readlines()

# Creating a workbook object
xldoc = Workbook()
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
# Iterating and saving the data to sheet
for i, row in enumerate(data):
    # Two things are done here
    # Removeing the '\n' which comes while reading the file using io.open
    # Getting the values after splitting using '\t'
    for j, val in enumerate(row.replace('\n', '').split('\t')):
        sheet.write(i, j, val)

# Saving the file as an excel file
xldoc.save('myexcel.xls')

import pandas as pd
df = pd.ExcelFile('myexcel.xls').parse('Sheet1')

没有错误。

Answer 2

解决此问题的另一种方法是使用 win32com.client 库：

import win32com.client
import os

o = win32com.client.Dispatch("Excel.Application")
o.Visible = False

filename = os.getcwd() + '/' + 'SALEJAN17.xls'
output = os.getcwd() + '/' + 'myexcel.xlsx'

wb = o.Workbooks.Open(filename)
wb.ActiveSheet.SaveAs(output,51)

在我的示例中，您保存为 .xlsx 格式，但您也可以保存为 .xls。

python: 正在转换损坏的 xls 文件

python: converting corrupt xls file

python

excel

xlrd

pandas