如何使用 PySpark 在选择性列中插入数据？

Question

我在 Redshift 上有一个 table，我想使用 pyspark 数据框向其中插入一些数据。
红移 table 具有架构：

CREATE TABLE admin.audit_of_all_tables
(
    wh_table_name varchar,
    wh_schema_name varchar,
    wh_population_method integer,
    wh_audit_date timestamp without time,
    wh_percent_change numeric(15,5),
    wh_s3_path varchar
)
DISTSTYLE AUTO;

在我的数据框中，我只想保留前 4 列的值并将该数据框的数据写入此 table。
我的数据框是这样的：

现在，我想在 Redshift 上对我的 table 执行 df.write.format，但我需要以某种方式指定我只想将数据插入前四列并且不传递任何值最后 2 列（默认情况下保持它们为空）。
知道如何使用 dataframe.write.format （或任何方法）来指定它。
感谢阅读。

Answer 1

你可以使用selectExpr to select the first four columns plus two additional columns with null that have been cast到需要的类型：

df2 = df.selectExpr("table_name as wh_table_name",
    "schema_name as wh_schema_name",
    "population_method as wh_population_method",
    "audit_date as wh_audit_date",
    "cast(null as double) as wh_percent_change",
    "cast(null as string) as wh_s3_path")

df2.write....

如何使用 PySpark 在选择性列中插入数据？

How do I insert data in selective columns using PySpark?

python

dataframe

amazon-redshift

apache-spark-sql

pyspark