使用 pyspark 检查数据框的所有列中是否存在大于零的值
Check if value greater than zero exists in all columns of dataframe using pyspark
data.select([count(when(isnan(c), c)).alias(c) for c in data.columns]).show()
这是我试图获取 nan 值计数的代码。我想编写一个 if-else 条件,如果特定列包含 nan 值,我想打印列的名称和 nan 值的计数。
您可以将相同的理解转换为:
df.select([count(when(c > 0, c)).alias(c) for c in data.columns]).show()
但是当你有其他 dtypes
时,这会导致问题。
所以让我们一起去:
from pyspark.sql.functions import col
# You can do the following two lines of code in one line, but want to make it more readable
schema = {col: col_type for col, col_type in df.dtypes}
numeric_columns = [
col for col, col_type in schema.items()
if col_type in "int double bitint".split()
]
df.select([count(when(col(c) > 0, c)).alias(c) for c in numeric_columns]).show()
如果我对你的理解正确,你想在将其传递给列表理解之前先执行 列 过滤。
例如,您有一个如下所示的 df,其中 c 列 nan free,
from pyspark.sql.functions import isnan, count, when
import numpy as np
df = spark.createDataFrame([(1.0, np.nan, 0.0), (np.nan, 2.0, 9.0),\
(np.nan, 3.0, 8.0), (np.nan, 4.0, 7.0)], ('a', 'b', 'c'))
df.show()
# +---+---+---+
# | a| b| c|
# +---+---+---+
# |1.0|NaN|0.0|
# |NaN|2.0|9.0|
# |NaN|3.0|8.0|
# |NaN|4.0|7.0|
# +---+---+---+
你得到了生产的解决方案和材料
df.select([count(when((isnan(c)),c)).alias(c) for c in df.columns]).show()
# +---+---+---+
# | a| b| c|
# +---+---+---+
# | 3| 1| 0|
# +---+---+---+
但你想要
# +---+---+
# | a| b|
# +---+---+
# | 3| 1|
# +---+---+
为了得到那个输出,你可以试试这个
rows = df.collect()
#column filtering based on your nan condition
nan_columns = [''.join(key) for _ in rows for (key,val) in _.asDict().items() if np.isnan(val)]
nan_columns = list(set(nan_columns)) #may sort if order is important
#nan_columns
#['a', 'b']
df.select([count(when((isnan(c)),c)).alias(c) for c in nan_columns]).show()
# +---+---+
# | a| b|
# +---+---+
# | 3| 1|
# +---+---+
data.select([count(when(isnan(c), c)).alias(c) for c in data.columns]).show()
这是我试图获取 nan 值计数的代码。我想编写一个 if-else 条件,如果特定列包含 nan 值,我想打印列的名称和 nan 值的计数。
您可以将相同的理解转换为:
df.select([count(when(c > 0, c)).alias(c) for c in data.columns]).show()
但是当你有其他 dtypes
时,这会导致问题。
所以让我们一起去:
from pyspark.sql.functions import col
# You can do the following two lines of code in one line, but want to make it more readable
schema = {col: col_type for col, col_type in df.dtypes}
numeric_columns = [
col for col, col_type in schema.items()
if col_type in "int double bitint".split()
]
df.select([count(when(col(c) > 0, c)).alias(c) for c in numeric_columns]).show()
如果我对你的理解正确,你想在将其传递给列表理解之前先执行 列 过滤。
例如,您有一个如下所示的 df,其中 c 列 nan free,
from pyspark.sql.functions import isnan, count, when
import numpy as np
df = spark.createDataFrame([(1.0, np.nan, 0.0), (np.nan, 2.0, 9.0),\
(np.nan, 3.0, 8.0), (np.nan, 4.0, 7.0)], ('a', 'b', 'c'))
df.show()
# +---+---+---+
# | a| b| c|
# +---+---+---+
# |1.0|NaN|0.0|
# |NaN|2.0|9.0|
# |NaN|3.0|8.0|
# |NaN|4.0|7.0|
# +---+---+---+
你得到了生产的解决方案和材料
df.select([count(when((isnan(c)),c)).alias(c) for c in df.columns]).show()
# +---+---+---+
# | a| b| c|
# +---+---+---+
# | 3| 1| 0|
# +---+---+---+
但你想要
# +---+---+
# | a| b|
# +---+---+
# | 3| 1|
# +---+---+
为了得到那个输出,你可以试试这个
rows = df.collect()
#column filtering based on your nan condition
nan_columns = [''.join(key) for _ in rows for (key,val) in _.asDict().items() if np.isnan(val)]
nan_columns = list(set(nan_columns)) #may sort if order is important
#nan_columns
#['a', 'b']
df.select([count(when((isnan(c)),c)).alias(c) for c in nan_columns]).show()
# +---+---+
# | a| b|
# +---+---+
# | 3| 1|
# +---+---+