如何通过对 scala databricks 中的其他列进行分组来比较行并识别列中的重复值
how to compare rows and identify the duplicate values in a column by grouping other columns in scala databricks
我想识别黄色行,因为它们是相同的日期并且属于相同的组 ID 和相同的标识符 1。
绿色的是正确的,因为它们属于不同的groupid和不同的identifier1。
Scala 具有 dropDuplicates
功能,可根据您提供的列删除重复项。一个简单的例子:
import org.apache.spark.sql.functions._
val df = Seq (
( 1, 1, 1234, "12010", "null" ),
( 1, 2, 1234, "22201", "null" ),
( 2, 1, 2345, "12011", "null" ),
( 2, 2, 2345, "12011", "null" ),
( 2, 3, 2345, "32011", "yellow" ),
( 2, 4, 2345, "32011", "yellow" ),
( 3, 1, 3456, "4012 ", "null" ),
( 3, 2, 3456, "52012", "green" ),
( 4, 1, 4567, "52012", "green" ),
( 4, 2, 4567, "52013", "null" )
)
.toDF( "identifier1", "identifier2", "groupid", "date", "colour" )
//df.show
// Drop the duplicates based on date and identifier1 columns
df
.dropDuplicates(Seq("date", "identifier1"))
.show
我的结果:
我想说从您的示例中并不能 100% 清楚到底需要什么,但希望这证明是一个有用的起点。阅读更多关于 dropDuplicates
here.
我想识别黄色行,因为它们是相同的日期并且属于相同的组 ID 和相同的标识符 1。 绿色的是正确的,因为它们属于不同的groupid和不同的identifier1。
Scala 具有 dropDuplicates
功能,可根据您提供的列删除重复项。一个简单的例子:
import org.apache.spark.sql.functions._
val df = Seq (
( 1, 1, 1234, "12010", "null" ),
( 1, 2, 1234, "22201", "null" ),
( 2, 1, 2345, "12011", "null" ),
( 2, 2, 2345, "12011", "null" ),
( 2, 3, 2345, "32011", "yellow" ),
( 2, 4, 2345, "32011", "yellow" ),
( 3, 1, 3456, "4012 ", "null" ),
( 3, 2, 3456, "52012", "green" ),
( 4, 1, 4567, "52012", "green" ),
( 4, 2, 4567, "52013", "null" )
)
.toDF( "identifier1", "identifier2", "groupid", "date", "colour" )
//df.show
// Drop the duplicates based on date and identifier1 columns
df
.dropDuplicates(Seq("date", "identifier1"))
.show
我的结果:
我想说从您的示例中并不能 100% 清楚到底需要什么,但希望这证明是一个有用的起点。阅读更多关于 dropDuplicates
here.