在 Pyspark 中进行 df 转换后访问行

Question

我有以下数据，已将单行数据帧转换为 RDD。我正在使用 PySpark 2.1.0.

[Row((1 - (count(YEAR_MTH) / count(1)))=0.0, 
(1 - (count(REPORTED_BY) / count(1)))=0.0, 
(1 - (count(FALLS_WITHIN) / count(1)))=0.0, 
(1 - (count(LOCATION) / count(1)))=0.0, 
(1 - (count(LSOA_CODE) / count(1)))=0.021671826625387025, 
(1 - (count(LSOA_NAME) / count(1)))=0.021671826625387025, 
(1 - (count(CRIME_TYPE) / count(1)))=0.0, 
(1 - (count(CURRENT_OUTCOME) / count(1)))=0.0, 
(1 - (count(FINAL_OUTCOME) / count(1)))=0.6377708978328174)]

我试图通过在数据帧上使用以下 select 到 RDD 转换来确定每列中值的百分比为 NULL：

col_with_nulls = df.agg(*[(1 - (fn.count(c) / fn.count('*'))) 
                    for c in cols_to_categorise]).rdd

此后，如果百分比较小，如 LSOA_CODE，但几乎是 FINAL_OUTCOME 的三分之二，则我可以安全地过滤具有小百分比的列的行，而是为具有大百分比的列估算数据。

最终目标是尽量减少数据丢失。所以问题是，如何从上面列出的 "Row" 访问列和百分比？

Answer 1

如果您在 agg 中为列添加别名，您可以获得每列的 null percetage 的漂亮字典：

null_percentage = df.agg(*[(1 - (fn.count(c) / fn.count('*'))).alias(c) 
     for c in cols_to_categorise]).first().asDict()

会以 {'LSOA_CODE': 0.021671826625387025, 'CRIME_TYPE': 0.0, ...}

的形式给你口述

在 Pyspark 中进行 df 转换后访问行

accessing a Row, after a df conversion, in Pyspark

apache-spark

data-cleaning

pyspark