Pandas:写入 Excel 在 Databricks 中不起作用
Pandas: Write to Excel not working in Databricks
我试图将 parquet 文件转换为 excel 文件。但是,当我尝试这样做时,使用 pandas 或 openpyxl 引擎,它显示“Operation not supported
”错误。但是,我可以在数据块中使用 openpyxl 引擎读取 excel 文件。
虽然阅读以下代码有效:
xlfile = '/dbfs/mnt/raw/BOMFILE.xlsx'
tmp_csv = '/dbfs/mnt/trusted/BOMFILE.csv'
pdf = pd.DataFrame(pd.read_excel(xlfile, engine='openpyxl'))
pdf.to_csv (tmp_csv, index = None, header=True)
但是,当我尝试使用 openpyxl 和 xlswriter 编写相同内容时,它不起作用:
parq = '/mnt/raw/PRODUCT.parquet'
final = '/dbfs/mnt/trusted/PRODUCT.xlsx'
df = spark.read.format("parquet").option("header", "true").load(parq)
pandas_df = df.toPandas()
pandas_df.to_excel(final, engine='openpyxl')
#pandas_df.to_excel(outfile, engine='xlsxwriter')#, sheet_name=tbl)
我得到的错误:
FileCreateError: [Errno 95] Operation not supported
OSError: [Errno 95] Operation not supported
During handling of the above exception, another exception occurred:
FileCreateError Traceback (most recent call last)
<command-473603709964454> in <module>
17 final = '/dbfs/mnt/trusted/PRODUCT.xlsx'
18 print(outfile)
---> 19 pandas_df.to_excel(outfile, engine='openpyxl')
20 #pandas_df.to_excel(outfile, engine='xlsxwriter')#, sheet_name=tbl)
/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes)
2179 startcol=startcol,
2180 freeze_panes=freeze_panes,
-> 2181 engine=engine,
2182 )
2183
请推荐。
问题是在DBFS中对本地文件API的支持有limitations(/dbfs
导火索)。例如,它不支持 Excel 文件所需的随机写入。来自文档:
Does not support random writes. For workloads that require random writes, perform the I/O on local disk first and then copy the result to /dbfs.
你的情况可能是:
from shutil import copyfile
parq = '/mnt/raw/PRODUCT.parquet'
final = '/dbfs/mnt/trusted/PRODUCT.xlsx'
temp_file = '/tmp/PRODUCT.xlsx'
df = spark.read.format("parquet").option("header", "true").load(parq)
pandas_df = df.toPandas()
pandas_df.to_excel(temp_file, engine='openpyxl')
copyfile(temp_file, final)
P.S。您还可以使用 dbutils.fs.cp
复制文件 (doc) - 它也适用于不支持 /dbfs
的社区版
我试图将 parquet 文件转换为 excel 文件。但是,当我尝试这样做时,使用 pandas 或 openpyxl 引擎,它显示“Operation not supported
”错误。但是,我可以在数据块中使用 openpyxl 引擎读取 excel 文件。
虽然阅读以下代码有效:
xlfile = '/dbfs/mnt/raw/BOMFILE.xlsx'
tmp_csv = '/dbfs/mnt/trusted/BOMFILE.csv'
pdf = pd.DataFrame(pd.read_excel(xlfile, engine='openpyxl'))
pdf.to_csv (tmp_csv, index = None, header=True)
但是,当我尝试使用 openpyxl 和 xlswriter 编写相同内容时,它不起作用:
parq = '/mnt/raw/PRODUCT.parquet'
final = '/dbfs/mnt/trusted/PRODUCT.xlsx'
df = spark.read.format("parquet").option("header", "true").load(parq)
pandas_df = df.toPandas()
pandas_df.to_excel(final, engine='openpyxl')
#pandas_df.to_excel(outfile, engine='xlsxwriter')#, sheet_name=tbl)
我得到的错误:
FileCreateError: [Errno 95] Operation not supported
OSError: [Errno 95] Operation not supported
During handling of the above exception, another exception occurred:
FileCreateError Traceback (most recent call last)
<command-473603709964454> in <module>
17 final = '/dbfs/mnt/trusted/PRODUCT.xlsx'
18 print(outfile)
---> 19 pandas_df.to_excel(outfile, engine='openpyxl')
20 #pandas_df.to_excel(outfile, engine='xlsxwriter')#, sheet_name=tbl)
/databricks/python/lib/python3.7/site-packages/pandas/core/generic.py in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes)
2179 startcol=startcol,
2180 freeze_panes=freeze_panes,
-> 2181 engine=engine,
2182 )
2183
请推荐。
问题是在DBFS中对本地文件API的支持有limitations(/dbfs
导火索)。例如,它不支持 Excel 文件所需的随机写入。来自文档:
Does not support random writes. For workloads that require random writes, perform the I/O on local disk first and then copy the result to /dbfs.
你的情况可能是:
from shutil import copyfile
parq = '/mnt/raw/PRODUCT.parquet'
final = '/dbfs/mnt/trusted/PRODUCT.xlsx'
temp_file = '/tmp/PRODUCT.xlsx'
df = spark.read.format("parquet").option("header", "true").load(parq)
pandas_df = df.toPandas()
pandas_df.to_excel(temp_file, engine='openpyxl')
copyfile(temp_file, final)
P.S。您还可以使用 dbutils.fs.cp
复制文件 (doc) - 它也适用于不支持 /dbfs