写入 Delta table-Azure Databricks 时检测到架构不匹配
A schema mismatch detected when writing to the Delta table - Azure Databricks
我尝试将“small_radio_json.json”加载到 Delta Lake table。在这段代码之后,我将创建 table.
我尝试创建 Delta table 但出现错误“写入 Delta table 时检测到模式不匹配。”
可能与events.write.format("delta").mode("overwrite").partitionBy("artist").save("/delta/events/")
分区有关
如何修复或修改代码。
//https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse
//https://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/delta/quickstart-scala.html
//Session configuration
val appID = "123558b9-3525-4c62-8c48-d3d7e2c16a6a"
val secret = "123[xEPjpOIBJtBS-W9B9Zsv7h9IF:qw"
val tenantID = "12344839-0afa-4fae-a34a-326c42112bca"
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "<appID>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<tenant-
id>/oauth2/token")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
//Account Information
val storageAccountName = "mydatalake"
val fileSystemName = "fileshare1"
spark.conf.set("fs.azure.account.auth.type." + storageAccountName + ".dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type." + storageAccountName +
".dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id." + storageAccountName + ".dfs.core.windows.net",
"" + appID + "")
spark.conf.set("fs.azure.account.oauth2.client.secret." + storageAccountName +
".dfs.core.windows.net", "" + secret + "")
spark.conf.set("fs.azure.account.oauth2.client.endpoint." + storageAccountName +
".dfs.core.windows.net", "https://login.microsoftonline.com/" + tenantID + "/oauth2/token")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://" + fileSystemName + "@" + storageAccountName + ".dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
dbutils.fs.cp("file:///tmp/small_radio_json.json", "abfss://" + fileSystemName + "@" +
storageAccountName + ".dfs.core.windows.net/")
val df = spark.read.json("abfss://" + fileSystemName + "@" + storageAccountName +
".dfs.core.windows.net/small_radio_json.json")
//df.show()
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val events = df
display(events)
import org.apache.spark.sql.SaveMode
events.write.format("delta").mode("overwrite").partitionBy("artist").save("/delta/events/")
import org.apache.spark.sql.SaveMode
val events_delta = spark.read.format("delta").load("/delta/events/")
display(events_delta)
异常:
org.apache.spark.sql.AnalysisException: A schema mismatch detected when writing to the Delta table.
To enable schema migration, please set:
'.option("mergeSchema", "true")'.
Table schema:
root
-- action: string (nullable = true)
-- date: string (nullable = true)
Data schema:
root
-- artist: string (nullable = true)
-- auth: string (nullable = true)
-- firstName: string (nullable = true)
-- gender: string (nullable = true)
很可能 /delta/events/
目录中有一些来自先前 运行 的数据,并且此数据可能与当前数据具有不同的架构,因此在将新数据加载到同一目录时,您将获得这种异常。
您收到模式不匹配错误,因为您的 table 中的列与数据框中的列不同。
根据您在问题中粘贴的错误快照,您的 table 架构只有两列,而您的数据框架构有四列:
Table schema:
root
-- action: string (nullable = true)
-- date: string (nullable = true)
Data schema:
root
-- artist: string (nullable = true)
-- auth: string (nullable = true)
-- firstName: string (nullable = true)
-- gender: string (nullable = true)
现在你有两个选择
- 如果您想保留数据框中存在的架构,则可以将
overwriteSchema
选项添加为 true。
- 如果您想保留所有列,您可以将
mergeSchema
的选项设置为 true。在这种情况下,它将合并模式,现在 table 将有六列,即数据框中的两个现有列和四个新列。
我尝试将“small_radio_json.json”加载到 Delta Lake table。在这段代码之后,我将创建 table.
我尝试创建 Delta table 但出现错误“写入 Delta table 时检测到模式不匹配。”
可能与events.write.format("delta").mode("overwrite").partitionBy("artist").save("/delta/events/")
如何修复或修改代码。
//https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse
//https://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/delta/quickstart-scala.html
//Session configuration
val appID = "123558b9-3525-4c62-8c48-d3d7e2c16a6a"
val secret = "123[xEPjpOIBJtBS-W9B9Zsv7h9IF:qw"
val tenantID = "12344839-0afa-4fae-a34a-326c42112bca"
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", "<appID>")
spark.conf.set("fs.azure.account.oauth2.client.secret", "<secret>")
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<tenant-
id>/oauth2/token")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
//Account Information
val storageAccountName = "mydatalake"
val fileSystemName = "fileshare1"
spark.conf.set("fs.azure.account.auth.type." + storageAccountName + ".dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type." + storageAccountName +
".dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id." + storageAccountName + ".dfs.core.windows.net",
"" + appID + "")
spark.conf.set("fs.azure.account.oauth2.client.secret." + storageAccountName +
".dfs.core.windows.net", "" + secret + "")
spark.conf.set("fs.azure.account.oauth2.client.endpoint." + storageAccountName +
".dfs.core.windows.net", "https://login.microsoftonline.com/" + tenantID + "/oauth2/token")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://" + fileSystemName + "@" + storageAccountName + ".dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
dbutils.fs.cp("file:///tmp/small_radio_json.json", "abfss://" + fileSystemName + "@" +
storageAccountName + ".dfs.core.windows.net/")
val df = spark.read.json("abfss://" + fileSystemName + "@" + storageAccountName +
".dfs.core.windows.net/small_radio_json.json")
//df.show()
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val events = df
display(events)
import org.apache.spark.sql.SaveMode
events.write.format("delta").mode("overwrite").partitionBy("artist").save("/delta/events/")
import org.apache.spark.sql.SaveMode
val events_delta = spark.read.format("delta").load("/delta/events/")
display(events_delta)
异常:
org.apache.spark.sql.AnalysisException: A schema mismatch detected when writing to the Delta table.
To enable schema migration, please set:
'.option("mergeSchema", "true")'.
Table schema:
root
-- action: string (nullable = true)
-- date: string (nullable = true)
Data schema:
root
-- artist: string (nullable = true)
-- auth: string (nullable = true)
-- firstName: string (nullable = true)
-- gender: string (nullable = true)
很可能 /delta/events/
目录中有一些来自先前 运行 的数据,并且此数据可能与当前数据具有不同的架构,因此在将新数据加载到同一目录时,您将获得这种异常。
您收到模式不匹配错误,因为您的 table 中的列与数据框中的列不同。
根据您在问题中粘贴的错误快照,您的 table 架构只有两列,而您的数据框架构有四列:
Table schema:
root
-- action: string (nullable = true)
-- date: string (nullable = true)
Data schema:
root
-- artist: string (nullable = true)
-- auth: string (nullable = true)
-- firstName: string (nullable = true)
-- gender: string (nullable = true)
现在你有两个选择
- 如果您想保留数据框中存在的架构,则可以将
overwriteSchema
选项添加为 true。 - 如果您想保留所有列,您可以将
mergeSchema
的选项设置为 true。在这种情况下,它将合并模式,现在 table 将有六列,即数据框中的两个现有列和四个新列。