将数据写入 Azure 数据块中的 Delta Lake 时出现问题(检测到不兼容的格式)
Trouble when writing the data to Delta Lake in Azure databricks (Incompatible format detected)
我需要将数据集读入DataFrame,然后将数据写入Delta Lake。但我有以下例外:
AnalysisException: 'Incompatible format detected.\n\nYou are trying to write to `dbfs:/user/class@azuredatabrickstraining.onmicrosoft.com/delta/customer-data/` using Databricks Delta, but there is no\ntransaction log present. Check the upstream job to make sure that it is writing\nusing format("delta") and that you are trying to write to the table base path.\n\nTo disable this check, SET spark.databricks.delta.formatCheck.enabled=false\nTo learn more about Delta, see https://docs.azuredatabricks.net/delta/index.html\n;
这是异常之前的代码:
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType
inputSchema = StructType([
StructField("InvoiceNo", IntegerType(), True),
StructField("StockCode", StringType(), True),
StructField("Description", StringType(), True),
StructField("Quantity", IntegerType(), True),
StructField("InvoiceDate", StringType(), True),
StructField("UnitPrice", DoubleType(), True),
StructField("CustomerID", IntegerType(), True),
StructField("Country", StringType(), True)
])
rawDataDF = (spark.read
.option("header", "true")
.schema(inputSchema)
.csv(inputPath)
)
# write to Delta Lake
rawDataDF.write.mode("overwrite").format("delta").partitionBy("Country").save(DataPath)
此错误消息告诉您目标路径中已经有数据(在本例中为 dbfs:/user/class@azuredatabrickstraining.onmicrosoft.com/delta/customer-data/
),并且该数据不是 Delta 格式(即没有事务日志)。您可以选择一个新路径(根据上面的评论,您似乎选择了新路径)或删除该目录并重试。
我通过此搜索发现了这个问题:“您正在尝试使用 Databricks Delta 写入 ***,但不存在事务日志。”
如果有人搜索相同的内容:
对我来说,解决方案是显式编码
.write.format("parquet")
因为
.format("delta")
是 Databricks Runtime 8.0 及更高版本以来的默认设置,出于遗留原因,我需要“parquet”。
我需要将数据集读入DataFrame,然后将数据写入Delta Lake。但我有以下例外:
AnalysisException: 'Incompatible format detected.\n\nYou are trying to write to `dbfs:/user/class@azuredatabrickstraining.onmicrosoft.com/delta/customer-data/` using Databricks Delta, but there is no\ntransaction log present. Check the upstream job to make sure that it is writing\nusing format("delta") and that you are trying to write to the table base path.\n\nTo disable this check, SET spark.databricks.delta.formatCheck.enabled=false\nTo learn more about Delta, see https://docs.azuredatabricks.net/delta/index.html\n;
这是异常之前的代码:
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType
inputSchema = StructType([
StructField("InvoiceNo", IntegerType(), True),
StructField("StockCode", StringType(), True),
StructField("Description", StringType(), True),
StructField("Quantity", IntegerType(), True),
StructField("InvoiceDate", StringType(), True),
StructField("UnitPrice", DoubleType(), True),
StructField("CustomerID", IntegerType(), True),
StructField("Country", StringType(), True)
])
rawDataDF = (spark.read
.option("header", "true")
.schema(inputSchema)
.csv(inputPath)
)
# write to Delta Lake
rawDataDF.write.mode("overwrite").format("delta").partitionBy("Country").save(DataPath)
此错误消息告诉您目标路径中已经有数据(在本例中为 dbfs:/user/class@azuredatabrickstraining.onmicrosoft.com/delta/customer-data/
),并且该数据不是 Delta 格式(即没有事务日志)。您可以选择一个新路径(根据上面的评论,您似乎选择了新路径)或删除该目录并重试。
我通过此搜索发现了这个问题:“您正在尝试使用 Databricks Delta 写入 ***,但不存在事务日志。”
如果有人搜索相同的内容: 对我来说,解决方案是显式编码
.write.format("parquet")
因为
.format("delta")
是 Databricks Runtime 8.0 及更高版本以来的默认设置,出于遗留原因,我需要“parquet”。