如何在 python 中合并多个带有超链接的 .xls 文件?
How to merge multiple .xls files with hyperlinks in python?
我正在尝试合并多个 .xls 文件,这些文件有很多列,但有 1 列带有超链接。我尝试使用 Python 执行此操作,但将 运行 保留为无法解决的错误。
为简洁起见,超链接隐藏在文本部分下。以下 ctrl-click 超链接是我在 .xls 文件中遇到的示例:ES2866911 (T3).
为了提高再现性,我在下面添加了 .xls1 和 .xls2 示例。
xls1:
Title
Publication_Number
P_A
ES2866911 (T3)
P_B
EP3887362 (A1)
.xls2:
Title
Publication_Number
P_C
AR118706 (A2)
P_D
ES2867600 (T3)
期望的结果:
Title
Publication_Number
P_A
ES2866911 (T3)
P_B
EP3887362 (A1)
P_C
AR118706 (A2)
P_D
ES2867600 (T3)
我无法在不丢失格式或超链接的情况下将 .xls 文件导入 Python。此外,我无法将 .xls 文件转换为 .xlsx。我不可能获得 .xlsx 格式的 .xls 文件。下面我简单总结一下我尝试过的:
1.) 用 pandas 阅读是我的第一次尝试。很容易做到,但是所有的超链接在 PD 中都丢失了,而且原始文件的所有格式都丢失了。
2.) 使用 openpyxl.load
读取 .xls 文件
InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.
3.) 将 .xls 文件转换为 .xlsx
from xls2xlsx import XLS2XLSX
x2x = XLS2XLSX(input.file.xls)
wb = x2x.to_xlsx()
x2x.to_xlsx('output_file.xlsx')
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
import pyexcel as p
p.save_book_as(file_name=input_file.xls, dest_file_name=export_file.xlsx)
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
During handling of the above exception, another exception occurred:
StopIteration
4.) 即使我们能够使用 xlrd 读取 .xls 文件(这意味着我们永远无法将文件另存为 .xlsx,我什至看不到超链接:
import xlrd
wb = xlrd.open_workbook(file) # where vis.xls is your test file
ws = wb.sheet_by_name('Sheet1')
ws.cell(5, 1).value
'AR118706 (A2)' #Which is the name, not hyperlink
5.) 我尝试安装旧版本的 openpyxl==3.0.1 来克服类型错误,但没有成功。我尝试使用带有 xlrd 引擎的 openpyxl 打开 .xls 文件,出现类似的打字错误“xml.entree.elementtree.element”错误。我尝试了多种方法将 .xls 文件批量转换为 .xlsx,但都出现了类似的错误。
显然我可以用 excel 打开并另存为 .xlsx 但这违背了整个目的,我不能对 100 个文件这样做。
没有清晰可复现的例子,问题不明确。假设我有两个名为 tmp.xls
和 tmp2.xls
的文件,其中包含以下两个屏幕截图中的虚拟数据。
然后 pandas
可以轻松地加载、连接和转换为 .xlsx
格式,而不会丢失超链接。这是一些演示代码和生成的文件:
import pandas as pd
f1 = pd.read_excel('tmp.xls')
f2 = pd.read_excel('tmp2.xls')
f3 = pd.concat([f1, f2], ignore_index=True)
f3.to_excel('./f3.xlsx')
您需要使用 xlrd 库正确读取超链接,pandas 将所有数据合并在一起,并使用 xlsxwriter 正确写入数据。
假设所有输入文件的格式相同,您可以使用以下代码。
# imports
import os
import xlrd
import xlsxwriter
import pandas as pd
# required functions
def load_excel_to_df(filepath, hyperlink_col):
book = xlrd.open_workbook(file_path)
sheet = book.sheet_by_index(0)
hyperlink_map = sheet.hyperlink_map
data = pd.read_excel(filepath)
hyperlink_col_index = list(data.columns).index(hyperlink_col)
required_links = [v.url_or_path for k, v in hyperlink_map.items() if k[1] == hyperlink_col_index]
data['hyperlinks'] = required_links
return data
# main code
# set required variables
input_data_dir = 'path/to/input/data/'
hyperlink_col = 'Publication_Number'
output_data_dir = 'path/to/output/data/'
output_filename = 'combined_data.xlsx'
# read and combine data
required_files = os.listdir(input_data_dir)
combined_data = pd.DataFrame()
for file in required_files:
curr_data = load_excel_to_df(data_dir + os.sep + file, hyperlink_col)
combined_data = combined_data.append(curr_data, sort=False, ignore_index=True)
cols = list(combined_data.columns)
m, n = combined_data.shape
hyperlink_col_index = cols.index(hyperlink_col)
# writing data
writer = pd.ExcelWriter(output_data_dir + os.sep + output_filename, engine='xlsxwriter')
combined_data[cols[:-1]].to_excel(writer, index=False, startrow=1, header=False) # last column contains hyperlinks
workbook = writer.book
worksheet = writer.sheets[list(workbook.sheetnames.keys())[0]]
for i, col in enumerate(cols[:-1]):
worksheet.write(0, i, col)
for i in range(m):
worksheet.write_url(i+1, hyperlink_col_index, combined_data.loc[i, cols[-1]], string=combined_data.loc[i, hyperlink_col])
writer.save()
参考文献:
- 阅读超链接 - https://whosebug.com/a/7057076/17256762
- pandas to_excel header 格式化 - Remove default formatting in header when converting pandas DataFrame to excel sheet
- 使用 xlsxwriter 编写超链接 - https://xlsxwriter.readthedocs.io/example_hyperlink.html
我假设 excel 文件与 daedalus 相同。我使用 openpyxl
代替 pandas 来读取和创建一个新的 excel 文件。
import openpyxl
wb1 = openpyxl.load_workbook('tmp.xlsx')
ws1 = wb.get_sheet_by_name('Sheet1')
wb2 = openpyxl.load_workbook('tmp2.xlsx')
ws2 = wb.get_sheet_by_name('Sheet1')
csvDict = {}
# Go through first sheet to find the hyperlinks and keys.
for (row in ws1.max_row):
hyperlink_dict[ws1.cell(row=row, column=1).value] =
[ws1.cell(row=row, column=2).hyperlink.target,
ws1.cell(row=row, column=2).value]
# Go Through second sheet to find hyperlinks and keys.
for (row in ws2.max_row):
hyperlink_dict[ws2.cell(row=row, column=1).value] =
[ws2.cell(row=row, column=2).hyperlink.target,
ws2.cell(row=row, column=2).value]
现在您拥有所有数据,因此您可以创建一个新工作簿并通过 opnenpyxl 将字典中的值保存到其中。
wb = Workbook(write_only=true)
ws = wb.create_sheet()
for irow in len(csvDict):
#use ws.append() to add the data from the csv.
wb.save('new_big_file.xlsx')
https://openpyxl.readthedocs.io/en/stable/optimized.html#write-only-mode
受@Kunal 的启发,我设法编写了避免使用 Pandas 库的代码。 .xls 文件由 xlrd 读取,并由 xlwt 写入新的 excel 文件。超链接已维护,输出文件已保存为 .xlsx 格式:
import os
import xlwt
from xlrd import open_workbook
# read and combine data
directory = "random_directory"
required_files = os.listdir(directory)
#Define new file and sheet to get files into
new_file = xlwt.Workbook(encoding='utf-8', style_compression = 0)
new_sheet = new_file.add_sheet('Sheet1', cell_overwrite_ok = True)
#Initialize header row, can be done with any file
old_file = open_workbook(directory+"/"+required_files[0], formatting_info=True)
old_sheet = old_file.sheet_by_index(0)
for column in list(range(0, old_sheet.ncols)):
new_sheet.write(0, column, old_sheet.cell(0, column).value) #To create header row
#Add rows from all files present in folder
for file in required_files:
old_file = open_workbook(directory+"/"+file, formatting_info=True)
old_sheet = old_file.sheet_by_index(0) #Define old sheet
hyperlink_map = old_sheet.hyperlink_map #Create map of all hyperlinks
for row in range(1, old_sheet.nrows): #We need all rows except header row
if row-1 < len(hyperlink_map.items()): #Statement to ensure we do not go out of range on the lower side of hyperlink_map.items()
Row_depth=len(new_sheet._Worksheet__rows) #We need row depth to know where to add new row
for col in list(range(old_sheet.ncols)): #For every column we need to add row cell
if col is 1: #We need to make an exception for column 2 being the hyperlinked column
click=list(hyperlink_map.items())[row-1][1].url_or_path #define URL
new_sheet.write(Row_depth, col, xlwt.Formula('HYPERLINK("{}", "{}")'.format(click, old_sheet.cell(row, 1).value)))
else: #If not hyperlinked column
new_sheet.write(Row_depth, col, old_sheet.cell(row, col).value) #Write cell
new_file.save("random_directory/output_file.xlsx")
我正在尝试合并多个 .xls 文件,这些文件有很多列,但有 1 列带有超链接。我尝试使用 Python 执行此操作,但将 运行 保留为无法解决的错误。
为简洁起见,超链接隐藏在文本部分下。以下 ctrl-click 超链接是我在 .xls 文件中遇到的示例:ES2866911 (T3).
为了提高再现性,我在下面添加了 .xls1 和 .xls2 示例。
xls1:
Title | Publication_Number |
---|---|
P_A | ES2866911 (T3) |
P_B | EP3887362 (A1) |
.xls2:
Title | Publication_Number |
---|---|
P_C | AR118706 (A2) |
P_D | ES2867600 (T3) |
期望的结果:
Title | Publication_Number |
---|---|
P_A | ES2866911 (T3) |
P_B | EP3887362 (A1) |
P_C | AR118706 (A2) |
P_D | ES2867600 (T3) |
我无法在不丢失格式或超链接的情况下将 .xls 文件导入 Python。此外,我无法将 .xls 文件转换为 .xlsx。我不可能获得 .xlsx 格式的 .xls 文件。下面我简单总结一下我尝试过的:
1.) 用 pandas 阅读是我的第一次尝试。很容易做到,但是所有的超链接在 PD 中都丢失了,而且原始文件的所有格式都丢失了。
2.) 使用 openpyxl.load
读取 .xls 文件InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.
3.) 将 .xls 文件转换为 .xlsx
from xls2xlsx import XLS2XLSX
x2x = XLS2XLSX(input.file.xls)
wb = x2x.to_xlsx()
x2x.to_xlsx('output_file.xlsx')
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
import pyexcel as p
p.save_book_as(file_name=input_file.xls, dest_file_name=export_file.xlsx)
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
During handling of the above exception, another exception occurred:
StopIteration
4.) 即使我们能够使用 xlrd 读取 .xls 文件(这意味着我们永远无法将文件另存为 .xlsx,我什至看不到超链接:
import xlrd
wb = xlrd.open_workbook(file) # where vis.xls is your test file
ws = wb.sheet_by_name('Sheet1')
ws.cell(5, 1).value
'AR118706 (A2)' #Which is the name, not hyperlink
5.) 我尝试安装旧版本的 openpyxl==3.0.1 来克服类型错误,但没有成功。我尝试使用带有 xlrd 引擎的 openpyxl 打开 .xls 文件,出现类似的打字错误“xml.entree.elementtree.element”错误。我尝试了多种方法将 .xls 文件批量转换为 .xlsx,但都出现了类似的错误。
显然我可以用 excel 打开并另存为 .xlsx 但这违背了整个目的,我不能对 100 个文件这样做。
没有清晰可复现的例子,问题不明确。假设我有两个名为 tmp.xls
和 tmp2.xls
的文件,其中包含以下两个屏幕截图中的虚拟数据。
然后 pandas
可以轻松地加载、连接和转换为 .xlsx
格式,而不会丢失超链接。这是一些演示代码和生成的文件:
import pandas as pd
f1 = pd.read_excel('tmp.xls')
f2 = pd.read_excel('tmp2.xls')
f3 = pd.concat([f1, f2], ignore_index=True)
f3.to_excel('./f3.xlsx')
您需要使用 xlrd 库正确读取超链接,pandas 将所有数据合并在一起,并使用 xlsxwriter 正确写入数据。 假设所有输入文件的格式相同,您可以使用以下代码。
# imports
import os
import xlrd
import xlsxwriter
import pandas as pd
# required functions
def load_excel_to_df(filepath, hyperlink_col):
book = xlrd.open_workbook(file_path)
sheet = book.sheet_by_index(0)
hyperlink_map = sheet.hyperlink_map
data = pd.read_excel(filepath)
hyperlink_col_index = list(data.columns).index(hyperlink_col)
required_links = [v.url_or_path for k, v in hyperlink_map.items() if k[1] == hyperlink_col_index]
data['hyperlinks'] = required_links
return data
# main code
# set required variables
input_data_dir = 'path/to/input/data/'
hyperlink_col = 'Publication_Number'
output_data_dir = 'path/to/output/data/'
output_filename = 'combined_data.xlsx'
# read and combine data
required_files = os.listdir(input_data_dir)
combined_data = pd.DataFrame()
for file in required_files:
curr_data = load_excel_to_df(data_dir + os.sep + file, hyperlink_col)
combined_data = combined_data.append(curr_data, sort=False, ignore_index=True)
cols = list(combined_data.columns)
m, n = combined_data.shape
hyperlink_col_index = cols.index(hyperlink_col)
# writing data
writer = pd.ExcelWriter(output_data_dir + os.sep + output_filename, engine='xlsxwriter')
combined_data[cols[:-1]].to_excel(writer, index=False, startrow=1, header=False) # last column contains hyperlinks
workbook = writer.book
worksheet = writer.sheets[list(workbook.sheetnames.keys())[0]]
for i, col in enumerate(cols[:-1]):
worksheet.write(0, i, col)
for i in range(m):
worksheet.write_url(i+1, hyperlink_col_index, combined_data.loc[i, cols[-1]], string=combined_data.loc[i, hyperlink_col])
writer.save()
参考文献:
- 阅读超链接 - https://whosebug.com/a/7057076/17256762
- pandas to_excel header 格式化 - Remove default formatting in header when converting pandas DataFrame to excel sheet
- 使用 xlsxwriter 编写超链接 - https://xlsxwriter.readthedocs.io/example_hyperlink.html
我假设 excel 文件与 daedalus 相同。我使用 openpyxl
代替 pandas 来读取和创建一个新的 excel 文件。
import openpyxl
wb1 = openpyxl.load_workbook('tmp.xlsx')
ws1 = wb.get_sheet_by_name('Sheet1')
wb2 = openpyxl.load_workbook('tmp2.xlsx')
ws2 = wb.get_sheet_by_name('Sheet1')
csvDict = {}
# Go through first sheet to find the hyperlinks and keys.
for (row in ws1.max_row):
hyperlink_dict[ws1.cell(row=row, column=1).value] =
[ws1.cell(row=row, column=2).hyperlink.target,
ws1.cell(row=row, column=2).value]
# Go Through second sheet to find hyperlinks and keys.
for (row in ws2.max_row):
hyperlink_dict[ws2.cell(row=row, column=1).value] =
[ws2.cell(row=row, column=2).hyperlink.target,
ws2.cell(row=row, column=2).value]
现在您拥有所有数据,因此您可以创建一个新工作簿并通过 opnenpyxl 将字典中的值保存到其中。
wb = Workbook(write_only=true)
ws = wb.create_sheet()
for irow in len(csvDict):
#use ws.append() to add the data from the csv.
wb.save('new_big_file.xlsx')
https://openpyxl.readthedocs.io/en/stable/optimized.html#write-only-mode
受@Kunal 的启发,我设法编写了避免使用 Pandas 库的代码。 .xls 文件由 xlrd 读取,并由 xlwt 写入新的 excel 文件。超链接已维护,输出文件已保存为 .xlsx 格式:
import os
import xlwt
from xlrd import open_workbook
# read and combine data
directory = "random_directory"
required_files = os.listdir(directory)
#Define new file and sheet to get files into
new_file = xlwt.Workbook(encoding='utf-8', style_compression = 0)
new_sheet = new_file.add_sheet('Sheet1', cell_overwrite_ok = True)
#Initialize header row, can be done with any file
old_file = open_workbook(directory+"/"+required_files[0], formatting_info=True)
old_sheet = old_file.sheet_by_index(0)
for column in list(range(0, old_sheet.ncols)):
new_sheet.write(0, column, old_sheet.cell(0, column).value) #To create header row
#Add rows from all files present in folder
for file in required_files:
old_file = open_workbook(directory+"/"+file, formatting_info=True)
old_sheet = old_file.sheet_by_index(0) #Define old sheet
hyperlink_map = old_sheet.hyperlink_map #Create map of all hyperlinks
for row in range(1, old_sheet.nrows): #We need all rows except header row
if row-1 < len(hyperlink_map.items()): #Statement to ensure we do not go out of range on the lower side of hyperlink_map.items()
Row_depth=len(new_sheet._Worksheet__rows) #We need row depth to know where to add new row
for col in list(range(old_sheet.ncols)): #For every column we need to add row cell
if col is 1: #We need to make an exception for column 2 being the hyperlinked column
click=list(hyperlink_map.items())[row-1][1].url_or_path #define URL
new_sheet.write(Row_depth, col, xlwt.Formula('HYPERLINK("{}", "{}")'.format(click, old_sheet.cell(row, 1).value)))
else: #If not hyperlinked column
new_sheet.write(Row_depth, col, old_sheet.cell(row, col).value) #Write cell
new_file.save("random_directory/output_file.xlsx")