scala 如何根据列值从 df 中删除行
scala how to drop lines from df based on the column value
我有包含这些值的数据框,我需要过滤最小日期(groupby(id,count)和摘要应更改为等于更多
id secid count date summary
1 2 9 20170608 equal
1 3 9 20160608 equal
2 3 8 20170608 less
3 3 9 20160608 equal
我需要展示
id secid count date summary
1 2 9 20170608 more
2 3 8 20170608 less
3 3 9 20160608 equal
您可以使用 groupBy
将 id
和 count
组合在一起,然后使用 when
和 otherwise
将摘要字段更改为 more
以防相同的 id
和 count
有更多 date
。
//create your original DF
val df = Seq((1, 2, 9, 20170608, "equal"),
(1, 3, 9, 20160608, "equal"),
(2, 3, 8, 20170608, "less"),
(3, 3, 9, 20160608, "equal"),
(1, 2, 8, 20170608, "random"),
(1, 2, 8, 20170608, "random"))
.toDF("id", "secid", "count", "date", "summary")
//Create a UDF to find the length of datelist after grouping
val isMoreThanOne = udf((lst: Seq[Int], summary: String) => lst.size > 1 && summary.equals("equal"))
//apply groupby and other operations to get the result
df.groupBy("id", "count")
.agg(collect_list("date").as("datelist"),
max("date").as("date"),
first("secid").as("secid"),
first("summary").as("summary"))
.withColumn("summary",
when(isMoreThanOne($"datelist", $"summary"), "more").otherwise($"summary"))
.drop("datelist")
.show()
// output
// +---+-----+--------+-----+-------+
// | id|count| date|secid|summary|
// +---+-----+--------+-----+-------+
// | 1| 9|20170608| 2| more|
// | 1| 8|20170608| 2| random|
// | 3| 9|20160608| 3| equal|
// | 2| 8|20170608| 3| less|
// +---+-----+--------+-----+-------+
我有包含这些值的数据框,我需要过滤最小日期(groupby(id,count)和摘要应更改为等于更多
id secid count date summary
1 2 9 20170608 equal
1 3 9 20160608 equal
2 3 8 20170608 less
3 3 9 20160608 equal
我需要展示
id secid count date summary
1 2 9 20170608 more
2 3 8 20170608 less
3 3 9 20160608 equal
您可以使用 groupBy
将 id
和 count
组合在一起,然后使用 when
和 otherwise
将摘要字段更改为 more
以防相同的 id
和 count
有更多 date
。
//create your original DF
val df = Seq((1, 2, 9, 20170608, "equal"),
(1, 3, 9, 20160608, "equal"),
(2, 3, 8, 20170608, "less"),
(3, 3, 9, 20160608, "equal"),
(1, 2, 8, 20170608, "random"),
(1, 2, 8, 20170608, "random"))
.toDF("id", "secid", "count", "date", "summary")
//Create a UDF to find the length of datelist after grouping
val isMoreThanOne = udf((lst: Seq[Int], summary: String) => lst.size > 1 && summary.equals("equal"))
//apply groupby and other operations to get the result
df.groupBy("id", "count")
.agg(collect_list("date").as("datelist"),
max("date").as("date"),
first("secid").as("secid"),
first("summary").as("summary"))
.withColumn("summary",
when(isMoreThanOne($"datelist", $"summary"), "more").otherwise($"summary"))
.drop("datelist")
.show()
// output
// +---+-----+--------+-----+-------+
// | id|count| date|secid|summary|
// +---+-----+--------+-----+-------+
// | 1| 9|20170608| 2| more|
// | 1| 8|20170608| 2| random|
// | 3| 9|20160608| 3| equal|
// | 2| 8|20170608| 3| less|
// +---+-----+--------+-----+-------+