scala 如何根据列值从 df 中删除行

Question

我有包含这些值的数据框，我需要过滤最小日期（groupby（id，count）和摘要应更改为等于更多

id secid count date   summary
1   2     9    20170608  equal
1   3     9    20160608  equal
2   3     8    20170608  less
3   3     9    20160608  equal

我需要展示

id secid count date   summary
1   2     9    20170608  more
2   3     8    20170608  less
3   3     9    20160608  equal

Answer 1

您可以使用 groupBy 将 id 和 count 组合在一起，然后使用 when 和 otherwise 将摘要字段更改为 more 以防相同的 id 和 count 有更多 date。

//create your original DF
val df = Seq((1, 2, 9, 20170608, "equal"),
      (1, 3, 9, 20160608, "equal"),
      (2, 3, 8, 20170608, "less"),
      (3, 3, 9, 20160608, "equal"),
      (1, 2, 8, 20170608, "random"),
      (1, 2, 8, 20170608, "random"))
      .toDF("id", "secid", "count", "date", "summary")

//Create a UDF to find the length of datelist after grouping
val isMoreThanOne = udf((lst: Seq[Int], summary: String) => lst.size > 1 && summary.equals("equal"))

//apply groupby and other operations to get the result
df.groupBy("id", "count")
  .agg(collect_list("date").as("datelist"),
    max("date").as("date"),
    first("secid").as("secid"),
    first("summary").as("summary"))
  .withColumn("summary",
    when(isMoreThanOne($"datelist", $"summary"), "more").otherwise($"summary"))
  .drop("datelist")
  .show()

//    output
//    +---+-----+--------+-----+-------+
//    | id|count|    date|secid|summary|
//    +---+-----+--------+-----+-------+
//    |  1|    9|20170608|    2|   more|
//    |  1|    8|20170608|    2| random|
//    |  3|    9|20160608|    3|  equal|
//    |  2|    8|20170608|    3|   less|
//    +---+-----+--------+-----+-------+

scala 如何根据列值从 df 中删除行

scala how to drop lines from df based on the column value

scala

apache-spark-sql

spark-dataframe