Pyspark：输入不匹配......期待 EOF

Question

我想向数据框添加一列，并且根据某个值是否出现在源中 json，该列的值应该是源中的值或 null。我的代码如下所示：

withColumn("STATUS_BIT", expr("case when 'statusBit:' in jsonDF.schema.simpleString() then statusBit else None end"))

当我运行这样做时，我得到“不匹配的输入 ''statusBit:'' 期待 {< EOF >, '-'} 。我用引号做错了吗？当我尝试

withColumn("STATUS_BIT", expr("case when \'statusBit:\' in jsonDF.schema.simpleString() then statusBit else None end"))

我得到了完全相同的错误。在没有 expr 的情况下尝试整个事情，但作为一个简单的 when，触发错误“条件应该是一个列”。运行 'statusBit:' in jsonDF.schema.simpleString() 本身 returns 对于我正在使用的测试数据是正确的，但不知何故我无法将它集成到数据帧中 transformation.Thanks提前得到您的帮助。

edit：应用 PLTC 提供的解决方案帮助很大，但我仍在努力在 when 子句中实现此解决方案：我试试

.withColumn("STATUS_BIT", when(lit(df.schema.simplestring()).contains("statusBit") is True, col(statusBit)).otherwise(None))

但它告诉我“条件应该是一个列”。所以我添加了一个名为 SCHEMA 的额外列，它等于 lit(df.schema.simpleString) 并且我在条件中使用了该列：

.withColumn("STATUS_BIT", when(col("SCHEMA").contains("statusBit"), col("StatusBit")).otherwise(None)

问题是，如果我运行使用不包含“statusBit”的测试数据，我会得到错误“No such struct field statusBit in ...”，这显然与我想要实现的目标

Answer 1

jsonDF.schema.simpleString()是Python变量，可以Python方式使用

from pyspark.sql import functions as F

df.withColumn("STATUS_BIT", F.lit(df.schema.simpleString()).contains('statusBit:'))

Pyspark：输入不匹配......期待 EOF

Pyspark: mismatched input ... expecting EOF

case

dataframe

apache-spark-sql

pyspark