Spark：DataFrame.saveAsTable 或 DataFrameWriter.options 可以传递哪些选项？

Question

开发人员和 API 文档均未包含有关可以在 DataFrame.saveAsTable 或 DataFrameWriter.options 中传递哪些选项的任何参考，它们会影响 Hive 的保存 table.

我希望在这个问题的答案中，我们可以汇总有助于 Spark 开发人员的信息，这些开发人员希望更好地控制 Spark 如何保存 tables，并且也许可以为改进 Spark 的文档。

Answer 1

版本之间的区别。

我们在spark2中有以下内容：

createOrReplaceTempView()
createTempView()
createOrReplaceGlobalTempView()
createGlobalView()

saveAsTable is deprecated in spark 2.

基本上这些都是根据table的可用性来划分的。请参考link

Answer 2

saveAsTable(String tableName)

Saves the content of the DataFrame as the specified table.

仅供参考 -> https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameWriter.html

Answer 3

根据源代码可以指定path选项（表示在hdfs中存储hive外部数据的位置，在Hive DDL中翻译为'location'）。不确定您还有其他与 saveAsTable 关联的选项，但我会搜索更多。

Answer 4

根据最新的 spark 文档，以下是使用 .saveAsTable(name, format=None, mode=None, partitionBy=None, **options) API

将 DataFrame 写入外部存储时可以传递的选项

如果您单击文档右侧的 source 超链接，您可以遍历并找到其他不太清楚的参数的详细信息例如。 format and options在classDataFrameWriter

下有描述

所以当文档显示为 options – all other string options 时，它指的是 options，这为您提供了以下关于 spark 2.4.4

的选项

timeZone: sets the string that indicates a timezone to be used to format timestamps in the JSON/CSV datasources or partition values. If it isn’t set, it uses the default value, session local timezone.

当它显示为 format – the format used to save 时，它指的是 format(source)

Specifies the underlying output data source.

参数

source – string,

name of the data source, e.g. ‘json’, ‘parquet’.

希望这对您有所帮助。

Answer 5

您在任何地方都看不到 options 的原因是它们是特定于格式的，开发人员可以使用一组新的 options.

继续创建自定义写入格式

但是，对于少数支持的格式，我列出了 spark 代码本身中提到的选项：

Answer 6

看看 https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/DeltaOptions.scala class“DeltaOptions”

目前支持的选项有：

replaceWhere
合并架构
覆盖架构
maxFilesPerTrigger
排除正则表达式
忽略文件删除
忽略更改
忽略删除
优化写入
数据变化
查询名称
检查点位置
路径
时间戳
versionAsOf

Spark：DataFrame.saveAsTable 或 DataFrameWriter.options 可以传递哪些选项？

Spark: what options can be passed with DataFrame.saveAsTable or DataFrameWriter.options?

hadoop

hive

scala

apache-spark

parquet