检测到冲突的分区列名称 Pyspark Databricks

Question

我正在尝试在数据块中使用 pyspark 读取一个 csv 文件。 marketingStartDate 是这种格式 yyyyMMdd 和 lastweek = marketingStartDate -7days

readFactToDataFrame('Facts', 
'Fact.csv',startDate=str(dateBefore),
           endDate=str(marketingStartDate),
           inferSchema=False)

我收到这条错误信息。你知道问题出在哪里吗？

df = spark.read.format(fileFormat).options(header=True, inferSchema=
inferSchema, delimiter = columnDelimiter).load(URL).filter("Year *
10000 + Month * 100 + Day = "+str(startDate) + " AND Year * 10000 +
Month * 100 + Day <=" + str(endDate)) 37 elif fileFormat == "json": 38
df =
spark.read.format(fileFormat).options(multiline=True).load(URL).filter("Year
* 10000 + Month * 100 + Day = "+str(startDate) + " AND Year * 10000 + Month * 100 + Day <=" + str(endDate))
/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path,
format, schema, **options) 164 self.options(**options) 165 if
isinstance(path, basestring): -- 166 return
self._df(self._jreader.load(path)) 167 elif path is not None: 168 if
type(path) != list:
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py
in __call__(self, *args) 1255 answer =
self.gateway_client.send_command(command) 1256 return_value =
get_return_value( - 1257 answer, self.gateway_client, self.target_id,
self.name) 1258 1259 for temp_arg in temp_args:
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def
deco(*a, **kw): 62 try: --- 63 return f(*a, **kw) 64 except
py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString()
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name) 326 raise
Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --
328 format(target_id, ".", name), value) 329 else: 330 raise
Py4JError( Py4JJavaError: An error occurred while calling o2301.load.
: java.lang.AssertionError: assertion failed: Conflicting partition
column names detected: Partition column name list #0: Year, Month, Day
Partition column name list #1: Year, Month For partitioned table
directories, data files should only live in leaf directories. And
directories at the same level should have the same partition column
name. Please check the following directories for unexpected files or
inconsistent partition column names:
dbfs:/mnt/DL/Facts/Year=2020/Month=08
dbfs:/mnt/DL/Facts/Year=2020/Month=03/Day=16
dbfs:/mnt/DL/Facts/Year=2019/Month=09/Day=27
dbfs:/mnt/DL/Facts/Year=2019/Month=09/Day=02
dbfs:/mnt/DL/Facts/Year=2020/Month=08/Day=01
dbfs:/mnt/DL/Facts/Year=2020/Month=03/Day=09
dbfs:/mnt/DL/Facts/Year=2020/Month=02/Day=26
dbfs:/mnt/DL/Facts/Year=2020/Month=08/Day=10
dbfs:/mnt/DL/Facts/Year=2019/Month=09/Day=12
dbfs:/mnt/DL/Facts/Year=2019/Month=10/Day=12
dbfs:/mnt/DL/Facts/Year=2020/Month=03/Day=24
dbfs:/mnt/DL/Facts/Year=2019/Month=09/Day=05
dbfs:/mnt/DL/Facts/Year=2020/Month=03/Day=13
dbfs:/mnt/DL/Facts/Year=2019/Month=10/Day=27
dbfs:/mnt/DL/Facts/Year=2019/Month=09/Day=16
dbfs:/mnt/DL/Facts/Year=2020/Month=02/Day=20
dbfs:/mnt/DL/Facts/Year=2020/Month=03/Day=31 at
scala.Predef$.assert(Predef.scala:170) at
org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:396)
at
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:197)
at
org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:132)

Answer 1

由于源位置 dbfs 引发以下错误：/mnt/DL/Facts 有两个不同的分区结构。

     java.lang.AssertionError: assertion failed: Conflicting partition column names detected: 
Partition column name list #0: Year, Month, Day 
Partition column name list #1: Year, Month

错误详细信息指向有问题的目录：

dbfs:/mnt/DL/Facts/Year=2020/Month=08

检查上面的 databricks 目录以查看此位置是否有任何文件。您可以删除它们或移动到某个不同的目录。如果上面的目录没有文件，你可以删除目录本身。

希望对解决这个问题有所帮助。

检测到冲突的分区列名称 Pyspark Databricks

Conflicting partition column names detected Pyspark Databricks

partitioning

apache-spark

pyspark

azure-blob-storage

databricks