如何找到 Dataframe 中的所有 Null 字段
How Can I find all the Null fields in Dataframe
我试图找出数据框中存在的空字段,并将所有字段连接到同一数据框中的新字段中。
输入数据框如下所示
name
state
number
James
CA
100
Julia
Null
Null
Null
CA
200
预期输出
name
state
number
Null Fields
James
CA
100
Julia
Null
Null
state,number
Null
CA
200
name
我的代码看起来像这样,但它失败了。请帮帮我。
from pyspark.sql import functions as F
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","CA","100"),
("Julia",None,None),
(None,"CA","200")]
schema = StructType([ \
StructField("name",StringType(),True), \
StructField("state",StringType(),True), \
StructField("number",StringType(),True)
])
df = spark.createDataFrame(data=data2,schema=schema)
cols = ["name","state","number"]
df.show()
def null_constraint_check(df,cols):
df_null_identifier = df.withColumn("NULL Fields",\
[F.count(F.when(F.col(c).isNull(), c)) for c in cols])
return df_null_identifier
df1 = null_constraint_check(df,cols)
我收到错误
AssertionError: col should be Column
你的做法是正确的,只需要在null_constraint_check
中做一点小改动:
[F.count(...)]
是一个列列表,withColumn expects a single column as second parameter. One way to get there is to concatenate all elements of the list using concat_ws:
def null_constraint_check(df,cols):
df_null_identifier = df.withColumn("NULL Fields",
F.concat_ws(",",*[F.when(F.col(c).isNull(), c) for c in cols]))
return df_null_identifier
我也删除了 F.count
因为你的问题说你想要空列的名称。
结果是:
+-----+-----+------+------------+
| name|state|number| NULL Fields|
+-----+-----+------+------------+
|James| CA| 100| |
|Julia| null| null|state,number|
| null| CA| 200| name|
+-----+-----+------+------------+
我试图找出数据框中存在的空字段,并将所有字段连接到同一数据框中的新字段中。
输入数据框如下所示
name | state | number |
---|---|---|
James | CA | 100 |
Julia | Null | Null |
Null | CA | 200 |
预期输出
name | state | number | Null Fields |
---|---|---|---|
James | CA | 100 | |
Julia | Null | Null | state,number |
Null | CA | 200 | name |
我的代码看起来像这样,但它失败了。请帮帮我。
from pyspark.sql import functions as F
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","CA","100"),
("Julia",None,None),
(None,"CA","200")]
schema = StructType([ \
StructField("name",StringType(),True), \
StructField("state",StringType(),True), \
StructField("number",StringType(),True)
])
df = spark.createDataFrame(data=data2,schema=schema)
cols = ["name","state","number"]
df.show()
def null_constraint_check(df,cols):
df_null_identifier = df.withColumn("NULL Fields",\
[F.count(F.when(F.col(c).isNull(), c)) for c in cols])
return df_null_identifier
df1 = null_constraint_check(df,cols)
我收到错误
AssertionError: col should be Column
你的做法是正确的,只需要在null_constraint_check
中做一点小改动:
[F.count(...)]
是一个列列表,withColumn expects a single column as second parameter. One way to get there is to concatenate all elements of the list using concat_ws:
def null_constraint_check(df,cols):
df_null_identifier = df.withColumn("NULL Fields",
F.concat_ws(",",*[F.when(F.col(c).isNull(), c) for c in cols]))
return df_null_identifier
我也删除了 F.count
因为你的问题说你想要空列的名称。
结果是:
+-----+-----+------+------------+
| name|state|number| NULL Fields|
+-----+-----+------+------------+
|James| CA| 100| |
|Julia| null| null|state,number|
| null| CA| 200| name|
+-----+-----+------+------------+