用其他列的频率填充一行中的空值

Question

在 spark 结构化流上下文中，我有这个数据框：

+------+----------+---------+
|brand |Timestamp |frequency|
+------+----------+---------+
|BR1   |1632899456|4        |
|BR1   |1632901256|4        |
|BR300 |1632901796|null     |
|BR300 |1632899155|null     |
|BR90  |1632901743|1        |
|BR1   |1632899933|4        |
|BR1   |1632899756|4        |
|BR22  |1632900776|null     |
|BR22  |1632900176|null     |
+------+----------+---------+

我想用批次中品牌的频率替换空值，以获得像这样的数据框：

+------+----------+---------+
|brand |Timestamp |frequency|
+------+----------+---------+
|BR1   |1632899456|4        |
|BR1   |1632901256|4        |
|BR300 |1632901796|2        | 
|BR300 |1632899155|2        |
|BR90  |1632901743|1        |
|BR1   |1632899933|4        |
|BR1   |1632899756|4        |
|BR22  |1632900776|2        |
|BR22  |1632900176|2        |
+------+----------+---------+

我正在使用 Spark 版本 2.4.3 和 SQLContext，以及 scala 语言。

Answer 1

嗨，兄弟，我是一名 java 程序员。最好循环遍历 freq 列并搜索第一个 null 及其相关品牌。因此，请计算直到 table 末尾的数量并更正该品牌的空值，然后寻找另一个空品牌并更正它。这是我的 java 解决方案 :(我没有测试这段代码只是用文本编辑器编写的，但我希望它能正常工作，70%；）

    //this is your table  +  dimensions
    table[9][3];    
    int repeatCounter = 0;
    String brand;
    boolean thereIsNull = true;
    //define an array to save the address of the specified null brand
    int[tablecolumns.length()] brandmemory; 
    while (thereisnull) {
        for (int i = 0; i < tablecolumns.length(); i++) {
            
            if (array[i][3] == null) {
                 thereIsNull = true;
                brand = array[i][1];
                for (int n = i; n < tablecolumns.length(); i++) {
                    if (brand == array[i][1]) {
                        repeatCounter++;
                         // making an array to save address of  the null brand in table:
                        brandmemory[repeatCounter] = i;
                        else{
                            break ;
                        }
                    }
                    for (int p = 1; p = repeatCounter ; p++) {
                        //changing null values to number of repeats 
                        array[brandmemory[p]][3] = repeatCounter;
                    }
                }
            }
            else{
                continue;
                //check if the table has any null content if no :end of program. 
                for(int w>i ; w=tablecolumns.length();w++ ){
                    if(array[w] != null  ){
                        thereIsNull = false;
                        else{ thereIsNull = true;
                        break;
                        
                    }
                }
            }
        }
    }

Answer 2

使用“计数”超过 window 函数：

val df = Seq(
  ("BR1", 1632899456, Some(4)),
  ("BR1", 1632901256, Some(4)),
  ("BR300", 1632901796, None),
  ("BR300", 1632899155, None),
  ("BR90", 1632901743, Some(1)),
  ("BR1", 1632899933, Some(4)),
  ("BR1", 1632899756, Some(4)),
  ("BR22", 1632900776, None),
  ("BR22", 1632900176, None)
).toDF("brand", "Timestamp", "frequency")

val brandWindow = Window.partitionBy("brand")
val result = df.withColumn("frequency", when($"frequency".isNotNull, $"frequency").otherwise(count($"brand").over(brandWindow)))

结果：

+-----+----------+---------+
|BR1  |1632899456|4        |
|BR1  |1632901256|4        |
|BR1  |1632899933|4        |
|BR1  |1632899756|4        |
|BR22 |1632900776|2        |
|BR22 |1632900176|2        |
|BR300|1632901796|2        |
|BR300|1632899155|2        |
|BR90 |1632901743|1        |
+-----+----------+---------+

GroupBy 解决方案：

val countDF = df.select("brand").groupBy("brand").count()


df.alias("df")
  .join(countDF.alias("cnt"), Seq("brand"))
  .withColumn("frequency", when($"df.frequency".isNotNull, $"df.frequency").otherwise($"cnt.count"))
  .select("df.brand", "df.Timestamp", "frequency")

用其他列的频率填充一行中的空值

Fill null values in a row with frequency of other column

scala

apache-spark

spark-streaming

spark-structured-streaming