如何提取包含多个选项卡的 Excel 电子表格?
How can I ingest an Excel spreadsheet with multiple tabs?
我想在远程文件夹或 SFTP 中提取 Excel 文件。它适用于 CSV 文件,但不适用于 XLS 或 XLSX 文件。
下面的代码提供了将 xls/xlsx 文件转换为 Spark 数据帧的函数。
要使用这些功能,您需要:
- 将下面的函数复制粘贴到您的存储库中(例如在
utils.py
文件中)
- 创建一个新的转换脚本
- 在转换脚本中,copy/paste示例转换并修改参数。
使用函数的示例转换:
# Parameters for Excel files with multiple tabs ingestion
SHEETS_PARAMETERS = {
# Each of these blocks will take one tab of your Excel file ("Artists" here) and write from "header" a dataset in the path provided "/Studio/studio_datasource/artists"
"Artists": {
"output_dataset_path": "/Studio/studio_datasource/artists",
"header": 7
},
"Records": {
"output_dataset_path": "/Studio/studio_datasource/records",
"header": 0
},
"Albums": {
"output_dataset_path": "/Studio/studio_datasource/albums",
"header": 1
}
}
# Define the dictionary of outputs needed by the transform's decorator
outputs = {
sheet_parameter["output_dataset_path"]: Output(sheet_parameter["output_dataset_path"])
for sheet_parameter in SHEETS_PARAMETERS.values()
}
@transform(
my_input=Input("/Studio/studio_datasource/excel_file"),
**outputs
)
def my_compute_function(my_input, ctx, **outputs):
# Add the output objects to the parameters
for sheetname, parameters in SHEETS_PARAMETERS.items():
output_dataset_path = SHEETS_PARAMETERS[sheetname]["output_dataset_path"]
SHEETS_PARAMETERS[sheetname]["output_dataset"] = outputs[output_dataset_path]
# Transform the sheets to datasets
write_datasets_from_excel_sheets(my_input, SHEETS_PARAMETERS, ctx)
函数:
import pandas as pd
import tempfile
import shutil
def normalize_column_name(cn):
"""
Remove forbidden characters from the columns names
"""
invalid_chars = " ,;{}()\n\t="
for c in invalid_chars:
cn = cn.replace(c, "_")
return cn
def get_dataframe_from_excel_sheet(fp, ctx, sheet_name, header):
"""
Generate a Spark dataframe from a sheet in an excel file available in Foundry
Arguments:
fp:
TemporaryFile object that allows to read to the file that contains the Excel file
ctx:
Context object available in a transform
sheet_name:
Name of the sheet
header:
Row (0-indexed) to use for the column labels of the parsed DataFrame.
If a list of integers is passed those row positions will be combined into a MultiIndex.
Use None if there is no header.
"""
# Using UTF-8 encoding is safer
dataframe = pd.read_excel(
fp,
sheet_name,
header=header,
encoding="utf-8"
)
# Cast all the dataframes as string
我想在远程文件夹或 SFTP 中提取 Excel 文件。它适用于 CSV 文件,但不适用于 XLS 或 XLSX 文件。
下面的代码提供了将 xls/xlsx 文件转换为 Spark 数据帧的函数。
要使用这些功能,您需要:
- 将下面的函数复制粘贴到您的存储库中(例如在
utils.py
文件中) - 创建一个新的转换脚本
- 在转换脚本中,copy/paste示例转换并修改参数。
使用函数的示例转换:
# Parameters for Excel files with multiple tabs ingestion
SHEETS_PARAMETERS = {
# Each of these blocks will take one tab of your Excel file ("Artists" here) and write from "header" a dataset in the path provided "/Studio/studio_datasource/artists"
"Artists": {
"output_dataset_path": "/Studio/studio_datasource/artists",
"header": 7
},
"Records": {
"output_dataset_path": "/Studio/studio_datasource/records",
"header": 0
},
"Albums": {
"output_dataset_path": "/Studio/studio_datasource/albums",
"header": 1
}
}
# Define the dictionary of outputs needed by the transform's decorator
outputs = {
sheet_parameter["output_dataset_path"]: Output(sheet_parameter["output_dataset_path"])
for sheet_parameter in SHEETS_PARAMETERS.values()
}
@transform(
my_input=Input("/Studio/studio_datasource/excel_file"),
**outputs
)
def my_compute_function(my_input, ctx, **outputs):
# Add the output objects to the parameters
for sheetname, parameters in SHEETS_PARAMETERS.items():
output_dataset_path = SHEETS_PARAMETERS[sheetname]["output_dataset_path"]
SHEETS_PARAMETERS[sheetname]["output_dataset"] = outputs[output_dataset_path]
# Transform the sheets to datasets
write_datasets_from_excel_sheets(my_input, SHEETS_PARAMETERS, ctx)
函数:
import pandas as pd
import tempfile
import shutil
def normalize_column_name(cn):
"""
Remove forbidden characters from the columns names
"""
invalid_chars = " ,;{}()\n\t="
for c in invalid_chars:
cn = cn.replace(c, "_")
return cn
def get_dataframe_from_excel_sheet(fp, ctx, sheet_name, header):
"""
Generate a Spark dataframe from a sheet in an excel file available in Foundry
Arguments:
fp:
TemporaryFile object that allows to read to the file that contains the Excel file
ctx:
Context object available in a transform
sheet_name:
Name of the sheet
header:
Row (0-indexed) to use for the column labels of the parsed DataFrame.
If a list of integers is passed those row positions will be combined into a MultiIndex.
Use None if there is no header.
"""
# Using UTF-8 encoding is safer
dataframe = pd.read_excel(
fp,
sheet_name,
header=header,
encoding="utf-8"
)
# Cast all the dataframes as string