如何使用scala在df中获取整行的大小
How to get whole row's size in df using scala
DataFrame 有多个列。我需要为整个行大小添加一个新列,这意味着我需要将所有列大小加在一起。有没有一种简单的方法可以有效地做到这一点?谢谢
示例如下:
val DataFrame = Seq(("Alice", "He is girl"), ("Bob", "She is girl"), ("Ben", null)).toDF("name","string")
display(DataFrame)
我想在 df 中添加一列,它可以对每列的长度求和。在这个示例中只有两列,但实际上我在 df 中有一百列。
val df = Seq(("Alice", "He is girl"),
("Bob", "She is girl"), ("Ben", null)).toDF("name","string")
scala> df.show
+-----+-----------+
| name| string|
+-----+-----------+
|Alice| He is girl|
| Bob|She is girl|
| Ben| null|
+-----+-----------+
删除空值:
val dfNoNull = df.na.fill("")
scala> dfNoNull.show
+-----+-----------+
| name| string|
+-----+-----------+
|Alice| He is girl|
| Bob|She is girl|
| Ben| |
+-----+-----------+
创建列列表,并对每个列应用 length
函数:
val cols = dfNoNull.columns.map(x => length(col(x)))
Select 数据基于这些 columns/expressions:
val dfColCounts = dfNoNull.select(cols:_*)
scala> dfColCounts.show
+------------+--------------+
|length(name)|length(string)|
+------------+--------------+
| 5| 10|
| 3| 11|
| 3| 0|
+------------+--------------+
获取这些新的列名称:
val countCols = dfColCounts.columns.map(x => col(x))
应用 reduce 对所有现在为整数的列值求和:
val dfPerRowCounts = dfColCounts
.withColumn("countPerRow", countCols.reduce(_ + _))
.select("countPerRow")
结果:
dfPerRowCounts.show
scala> dfPerRowCounts.show
+-----------+
|countPerRow|
+-----------+
| 15|
| 14|
| 3|
+-----------+
DataFrame 有多个列。我需要为整个行大小添加一个新列,这意味着我需要将所有列大小加在一起。有没有一种简单的方法可以有效地做到这一点?谢谢
示例如下:
val DataFrame = Seq(("Alice", "He is girl"), ("Bob", "She is girl"), ("Ben", null)).toDF("name","string")
display(DataFrame)
我想在 df 中添加一列,它可以对每列的长度求和。在这个示例中只有两列,但实际上我在 df 中有一百列。
val df = Seq(("Alice", "He is girl"),
("Bob", "She is girl"), ("Ben", null)).toDF("name","string")
scala> df.show
+-----+-----------+
| name| string|
+-----+-----------+
|Alice| He is girl|
| Bob|She is girl|
| Ben| null|
+-----+-----------+
删除空值:
val dfNoNull = df.na.fill("")
scala> dfNoNull.show
+-----+-----------+
| name| string|
+-----+-----------+
|Alice| He is girl|
| Bob|She is girl|
| Ben| |
+-----+-----------+
创建列列表,并对每个列应用 length
函数:
val cols = dfNoNull.columns.map(x => length(col(x)))
Select 数据基于这些 columns/expressions:
val dfColCounts = dfNoNull.select(cols:_*)
scala> dfColCounts.show
+------------+--------------+
|length(name)|length(string)|
+------------+--------------+
| 5| 10|
| 3| 11|
| 3| 0|
+------------+--------------+
获取这些新的列名称:
val countCols = dfColCounts.columns.map(x => col(x))
应用 reduce 对所有现在为整数的列值求和:
val dfPerRowCounts = dfColCounts
.withColumn("countPerRow", countCols.reduce(_ + _))
.select("countPerRow")
结果:
dfPerRowCounts.show
scala> dfPerRowCounts.show
+-----------+
|countPerRow|
+-----------+
| 15|
| 14|
| 3|
+-----------+