PySpark 插入覆盖问题
PySpark insert overwrite issue
下面是 PySpark ETL 代码的最后两行:
df_writer = DataFrameWriter(usage_fact)
df_writer.partitionBy("data_date", "data_product").saveAsTable(usageWideFactTable, format=fileFormat,mode=writeMode,path=usageWideFactpath)
其中,WriteMode= append 和 fileFormat=orc
我想用插入覆盖代替它,这样当我重新运行代码时我的数据就不会被附加。因此我使用了这个:
usage_fact.createOrReplaceTempView("usage_fact")
fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact")
但这给我以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 545, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Cannot overwrite a path that is also being read from.;'
看起来我无法覆盖我正在阅读的路径,但不知道如何纠正它,因为我是 PySpark 的新手。
我应该使用什么确切的代码来解决这个问题?
上面的代码对我有用。我刚刚在 DDL 中进行了更改,并使用以下详细信息重新创建了 table:(已删除的属性,如果使用)
PARTITIONED BY (
`data_date` string,
`data_product` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3://saasdata/datawarehouse/fact/UsageFact/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://saasdata/datawarehouse/fact/UsageFact/'
下面是 PySpark ETL 代码的最后两行:
df_writer = DataFrameWriter(usage_fact)
df_writer.partitionBy("data_date", "data_product").saveAsTable(usageWideFactTable, format=fileFormat,mode=writeMode,path=usageWideFactpath)
其中,WriteMode= append 和 fileFormat=orc
我想用插入覆盖代替它,这样当我重新运行代码时我的数据就不会被附加。因此我使用了这个:
usage_fact.createOrReplaceTempView("usage_fact")
fact = spark.sql("insert overwrite table " + usageWideFactTable + " partition (data_date, data_product) select * from usage_fact")
但这给我以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 545, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Cannot overwrite a path that is also being read from.;'
看起来我无法覆盖我正在阅读的路径,但不知道如何纠正它,因为我是 PySpark 的新手。 我应该使用什么确切的代码来解决这个问题?
上面的代码对我有用。我刚刚在 DDL 中进行了更改,并使用以下详细信息重新创建了 table:(已删除的属性,如果使用)
PARTITIONED BY (
`data_date` string,
`data_product` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'path'='s3://saasdata/datawarehouse/fact/UsageFact/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://saasdata/datawarehouse/fact/UsageFact/'