Spark SQL：结果文件中的空值正在转换为空字符串

Question

我在 AWS Glue 中编写了一个脚本，用于从 AWS S3 读取 CSV 文件，对几个字段应用空检查并将结果作为新文件存储回 S3。问题是当它遇到一个 String 类型的字段时，如果值为 null，它就会被转换为空字符串。但我不希望发生这种转换。对于所有其他数据类型，它工作正常。

这是目前编写的脚本：

glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session

# s3 output directory
output_dir = "s3://aws-glue-scripts/..."

# Data Catalog: database and table name
db_name = "sampledb"
tbl_name = "mytable"

datasource = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_name)

datasource_df = datasource.toDF()   
datasource_df.createOrReplaceTempView("myNewTable")
datasource_sql_df = spark.sql("SELECT * FROM myNewTable WHERE name IS NULL")
datasource_sql_df.show()

datasource_sql_dyf = DynamicFrame.fromDF(datasource_sql_df, glueContext, "datasource_sql_dyf")
glueContext.write_dynamic_frame.from_options(frame = datasource_sql_dyf, 
connection_type = "s3", connection_options = {"path": output_dir}, format = "json")

任何人都可以建议如何解决这个问题吗？

谢谢。

Answer 1

我认为目前不可能。 Spark 配置为在写入 JSON 时忽略空值。在 csv reader 中，它明确地将 null 值设置为空。

Spark SQL：结果文件中的空值正在转换为空字符串

Spark SQL: null values are getting converted to empty string in results file

apache-spark

apache-spark-sql

pyspark

aws-glue