如何删除要在 Scala 中删除的数据框的特定部分

how to remove specific portion of data frame to remove in scala

我已经阅读了使用 scala spark 将文本文件转换为数据帧 tables。 我还有这个问题:

table 看起来像这样:

 +------------+----------+----+----+
    |value       |col1      |col2|col3|
    +------------+----------+----+----+
    |FIRST:      |FIRST:    |null|null|
    |erewwetrt=1 |erewwetrt |1   |null|
    |wrtertret=2 |wrtertret |2   |null|
    |ertertert=3 |ertertert |3   |null|
    |;           |;         |null|null|
    |FIRST:      |FIRST:    |null|null|
    |asdafdfd=1  |asdafdfd  |1   |null|
    |adadfadf=2  |adadfadf  |2   |null|
    |adfdafdf=3  |adfdafdf  |3   |null|
    |;           |;         |null|null|
    |SECOND:     |SECOND:   |null|null|
    |adfsfsdfgg=1|adfsfsdfgg|1   |null|
    |sdfsdfdfg=2 |sdfsdfdfg |2   |null|
    |sdfsdgsdg=3 |sdfsdgsdg |3   |null|
    |;           |;         |null|null|

因此,最终数据帧 table 看起来像(只需要包含第一部分...)

+------------+----------+----+----+
 |value       |col1      |col2|col3|
 +------------+----------+----+----+
 |FIRST:      |FIRST:    |null|null|
 |erewwetrt=1 |erewwetrt |1   |null|
 |wrtertret=2 |wrtertret |2   |null|
 |ertertert=3 |ertertert |3   |null|
 |;           |;         |null|null|
 |FIRST:      |FIRST:    |null|null|
 |asdafdfd=1  |asdafdfd  |1   |null|
 |adadfadf=2  |adadfadf  |2   |null|
 |adfdafdf=3  |adfdafdf  |3   |null|
 |;           |;         |null|null|
 ...

我的问题是,如何删除从 SECOND 到 ; 的行。

如何在scala spark中实现?

所以这是我的快速而肮脏的解决方案:(请参阅下面更新的解决方案)

//Lets define a sample DF (just like your DF)
val df = spark.sparkContext.parallelize(Array(("First",1),("First",2),("dummy",3),("Second",4))).toDF
//Get the index of the row where "second" occurs
val idx = df.rdd.zipWithIndex.filter(x=> x._1(0) == "Second").map(x=> x._2).first
//filter
val res = df.rdd.zipWithIndex.filter(x=> x._2 < idx).map(x=> x._1)
//and the result:
res.collect
//Array[org.apache.spark.sql.Row] = Array([First,1], [First,2], [dummy,3])

哦,是的,如果您想将其转换回 DF,请执行以下操作:

val df_res = spark.createDataFrame(res,df.schema)

更新的解决方案: 根据其他输入,我更新我的答案如下: (我的假设是 "Second:....." 在文件中只出现一次。如果没有出现,您现在应该知道如何处理它了)

//new df for illustration
val df = spark.sparkContext.parallelize(Array(("First:",1),(";",2),("dummy",3),(";",4),("Second:",5),("some value",5), (";",6),("First:",7),(";",8) )).toDF
//zip wit index
val rdd = df.rdd.zipWithIndex
//this looks like:
rdd.collect
//res: Array[(org.apache.spark.sql.Row, Long)] = Array(([First,1],0), ([;,2],1), ([dummy,3],2), ([;,4],3), ([Second,5],4), ([some value,5],5), ([;,6],6), ([First,7],7), ([;,8],8))
// find the relevant index locations for "second" and ";"
val idx_second:Long = rdd.filter(x=> x._1(0) == "Second:").map(x=> x._2).first
val idx_semic:Long = rdd.filter(x=> x._1(0) == ";").filter(x=> x._2 >= idx_second).map(x=> x._2).first
// and here is the result
val result = rdd.filter(x=> (x._2 < idx_second) || (x._2 >idx_semic))
// verify the result
rdd.collect
// res: Array[(org.apache.spark.sql.Row, Long)] = Array(([First,1],0), ([;,2],1), ([dummy,3],2), ([;,4],3), ([First,7],6), ([;,8],7))

如@datamannz 所述创建数据框

val df = spark.sparkContext.parallelize(Array(("First",1),("First",2),("dummy",3),("Second",4))).toDF("value", "col1")

df.filter(col("value").notEqual("Second")).show
+-----+----+
|value|col1|
+-----+----+
|First|   1|
|First|   2|
|dummy|   3|
+-----+----+

答案如下, file.filter(col(x).notEqual(y)).aggregate()