如何删除要在 Scala 中删除的数据框的特定部分
how to remove specific portion of data frame to remove in scala
我已经阅读了使用 scala spark 将文本文件转换为数据帧 tables。
我还有这个问题:
table 看起来像这样:
+------------+----------+----+----+
|value |col1 |col2|col3|
+------------+----------+----+----+
|FIRST: |FIRST: |null|null|
|erewwetrt=1 |erewwetrt |1 |null|
|wrtertret=2 |wrtertret |2 |null|
|ertertert=3 |ertertert |3 |null|
|; |; |null|null|
|FIRST: |FIRST: |null|null|
|asdafdfd=1 |asdafdfd |1 |null|
|adadfadf=2 |adadfadf |2 |null|
|adfdafdf=3 |adfdafdf |3 |null|
|; |; |null|null|
|SECOND: |SECOND: |null|null|
|adfsfsdfgg=1|adfsfsdfgg|1 |null|
|sdfsdfdfg=2 |sdfsdfdfg |2 |null|
|sdfsdgsdg=3 |sdfsdgsdg |3 |null|
|; |; |null|null|
因此,最终数据帧 table 看起来像(只需要包含第一部分...)
+------------+----------+----+----+
|value |col1 |col2|col3|
+------------+----------+----+----+
|FIRST: |FIRST: |null|null|
|erewwetrt=1 |erewwetrt |1 |null|
|wrtertret=2 |wrtertret |2 |null|
|ertertert=3 |ertertert |3 |null|
|; |; |null|null|
|FIRST: |FIRST: |null|null|
|asdafdfd=1 |asdafdfd |1 |null|
|adadfadf=2 |adadfadf |2 |null|
|adfdafdf=3 |adfdafdf |3 |null|
|; |; |null|null|
...
我的问题是,如何删除从 SECOND 到 ;
的行。
如何在scala spark中实现?
所以这是我的快速而肮脏的解决方案:(请参阅下面更新的解决方案)
//Lets define a sample DF (just like your DF)
val df = spark.sparkContext.parallelize(Array(("First",1),("First",2),("dummy",3),("Second",4))).toDF
//Get the index of the row where "second" occurs
val idx = df.rdd.zipWithIndex.filter(x=> x._1(0) == "Second").map(x=> x._2).first
//filter
val res = df.rdd.zipWithIndex.filter(x=> x._2 < idx).map(x=> x._1)
//and the result:
res.collect
//Array[org.apache.spark.sql.Row] = Array([First,1], [First,2], [dummy,3])
哦,是的,如果您想将其转换回 DF,请执行以下操作:
val df_res = spark.createDataFrame(res,df.schema)
更新的解决方案:
根据其他输入,我更新我的答案如下:
(我的假设是 "Second:....." 在文件中只出现一次。如果没有出现,您现在应该知道如何处理它了)
//new df for illustration
val df = spark.sparkContext.parallelize(Array(("First:",1),(";",2),("dummy",3),(";",4),("Second:",5),("some value",5), (";",6),("First:",7),(";",8) )).toDF
//zip wit index
val rdd = df.rdd.zipWithIndex
//this looks like:
rdd.collect
//res: Array[(org.apache.spark.sql.Row, Long)] = Array(([First,1],0), ([;,2],1), ([dummy,3],2), ([;,4],3), ([Second,5],4), ([some value,5],5), ([;,6],6), ([First,7],7), ([;,8],8))
// find the relevant index locations for "second" and ";"
val idx_second:Long = rdd.filter(x=> x._1(0) == "Second:").map(x=> x._2).first
val idx_semic:Long = rdd.filter(x=> x._1(0) == ";").filter(x=> x._2 >= idx_second).map(x=> x._2).first
// and here is the result
val result = rdd.filter(x=> (x._2 < idx_second) || (x._2 >idx_semic))
// verify the result
rdd.collect
// res: Array[(org.apache.spark.sql.Row, Long)] = Array(([First,1],0), ([;,2],1), ([dummy,3],2), ([;,4],3), ([First,7],6), ([;,8],7))
如@datamannz 所述创建数据框
val df = spark.sparkContext.parallelize(Array(("First",1),("First",2),("dummy",3),("Second",4))).toDF("value", "col1")
df.filter(col("value").notEqual("Second")).show
+-----+----+
|value|col1|
+-----+----+
|First| 1|
|First| 2|
|dummy| 3|
+-----+----+
答案如下,
file.filter(col(x).notEqual(y)).aggregate()
我已经阅读了使用 scala spark 将文本文件转换为数据帧 tables。 我还有这个问题:
table 看起来像这样:
+------------+----------+----+----+
|value |col1 |col2|col3|
+------------+----------+----+----+
|FIRST: |FIRST: |null|null|
|erewwetrt=1 |erewwetrt |1 |null|
|wrtertret=2 |wrtertret |2 |null|
|ertertert=3 |ertertert |3 |null|
|; |; |null|null|
|FIRST: |FIRST: |null|null|
|asdafdfd=1 |asdafdfd |1 |null|
|adadfadf=2 |adadfadf |2 |null|
|adfdafdf=3 |adfdafdf |3 |null|
|; |; |null|null|
|SECOND: |SECOND: |null|null|
|adfsfsdfgg=1|adfsfsdfgg|1 |null|
|sdfsdfdfg=2 |sdfsdfdfg |2 |null|
|sdfsdgsdg=3 |sdfsdgsdg |3 |null|
|; |; |null|null|
因此,最终数据帧 table 看起来像(只需要包含第一部分...)
+------------+----------+----+----+
|value |col1 |col2|col3|
+------------+----------+----+----+
|FIRST: |FIRST: |null|null|
|erewwetrt=1 |erewwetrt |1 |null|
|wrtertret=2 |wrtertret |2 |null|
|ertertert=3 |ertertert |3 |null|
|; |; |null|null|
|FIRST: |FIRST: |null|null|
|asdafdfd=1 |asdafdfd |1 |null|
|adadfadf=2 |adadfadf |2 |null|
|adfdafdf=3 |adfdafdf |3 |null|
|; |; |null|null|
...
我的问题是,如何删除从 SECOND 到 ;
的行。
如何在scala spark中实现?
所以这是我的快速而肮脏的解决方案:(请参阅下面更新的解决方案)
//Lets define a sample DF (just like your DF)
val df = spark.sparkContext.parallelize(Array(("First",1),("First",2),("dummy",3),("Second",4))).toDF
//Get the index of the row where "second" occurs
val idx = df.rdd.zipWithIndex.filter(x=> x._1(0) == "Second").map(x=> x._2).first
//filter
val res = df.rdd.zipWithIndex.filter(x=> x._2 < idx).map(x=> x._1)
//and the result:
res.collect
//Array[org.apache.spark.sql.Row] = Array([First,1], [First,2], [dummy,3])
哦,是的,如果您想将其转换回 DF,请执行以下操作:
val df_res = spark.createDataFrame(res,df.schema)
更新的解决方案: 根据其他输入,我更新我的答案如下: (我的假设是 "Second:....." 在文件中只出现一次。如果没有出现,您现在应该知道如何处理它了)
//new df for illustration
val df = spark.sparkContext.parallelize(Array(("First:",1),(";",2),("dummy",3),(";",4),("Second:",5),("some value",5), (";",6),("First:",7),(";",8) )).toDF
//zip wit index
val rdd = df.rdd.zipWithIndex
//this looks like:
rdd.collect
//res: Array[(org.apache.spark.sql.Row, Long)] = Array(([First,1],0), ([;,2],1), ([dummy,3],2), ([;,4],3), ([Second,5],4), ([some value,5],5), ([;,6],6), ([First,7],7), ([;,8],8))
// find the relevant index locations for "second" and ";"
val idx_second:Long = rdd.filter(x=> x._1(0) == "Second:").map(x=> x._2).first
val idx_semic:Long = rdd.filter(x=> x._1(0) == ";").filter(x=> x._2 >= idx_second).map(x=> x._2).first
// and here is the result
val result = rdd.filter(x=> (x._2 < idx_second) || (x._2 >idx_semic))
// verify the result
rdd.collect
// res: Array[(org.apache.spark.sql.Row, Long)] = Array(([First,1],0), ([;,2],1), ([dummy,3],2), ([;,4],3), ([First,7],6), ([;,8],7))
如@datamannz 所述创建数据框
val df = spark.sparkContext.parallelize(Array(("First",1),("First",2),("dummy",3),("Second",4))).toDF("value", "col1")
df.filter(col("value").notEqual("Second")).show
+-----+----+
|value|col1|
+-----+----+
|First| 1|
|First| 2|
|dummy| 3|
+-----+----+
答案如下, file.filter(col(x).notEqual(y)).aggregate()