如何找到 Dataframe 中的所有 Null 字段

Question

我试图找出数据框中存在的空字段，并将所有字段连接到同一数据框中的新字段中。

输入数据框如下所示

name	state	number
James	CA	100
Julia	Null	Null
Null	CA	200

预期输出

name	state	number	Null Fields
James	CA	100
Julia	Null	Null	state,number
Null	CA	200	name

我的代码看起来像这样，但它失败了。请帮帮我。

from pyspark.sql import functions as F
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","CA","100"),
("Julia",None,None),
(None,"CA","200")]

schema = StructType([ \
StructField("name",StringType(),True), \
StructField("state",StringType(),True), \
StructField("number",StringType(),True)
])

df = spark.createDataFrame(data=data2,schema=schema)
cols = ["name","state","number"]
df.show()

def null_constraint_check(df,cols):
 df_null_identifier = df.withColumn("NULL Fields",\
                                    [F.count(F.when(F.col(c).isNull(), c)) for c in cols])
 return df_null_identifier
df1 = null_constraint_check(df,cols)

我收到错误

AssertionError: col should be Column

Answer 1

你的做法是正确的，只需要在null_constraint_check中做一点小改动： [F.count(...)] 是一个列列表，withColumn expects a single column as second parameter. One way to get there is to concatenate all elements of the list using concat_ws:

def null_constraint_check(df,cols):
    df_null_identifier = df.withColumn("NULL Fields",
                         F.concat_ws(",",*[F.when(F.col(c).isNull(), c) for c in cols]))
    return df_null_identifier

我也删除了 F.count 因为你的问题说你想要空列的名称。

结果是：

+-----+-----+------+------------+
| name|state|number| NULL Fields|
+-----+-----+------+------------+
|James|   CA|   100|            |
|Julia| null|  null|state,number|
| null|   CA|   200|        name|
+-----+-----+------+------------+

如何找到 Dataframe 中的所有 Null 字段

How Can I find all the Null fields in Dataframe

pyspark

databricks