Spark - 替换字符串中的第一次出现
Spark - Replace first occurrence in a string
我想在 spark scala sql 中使用 replaceFirst()
函数。
要么
是否可以在 spark scala 数据帧中使用 replaceFirst()
函数?
不使用 UDF 是否可行?
我要做的功能是:
println("abcdefgbchijkl".replaceFirst("bc","**BC**"))
// a**BC**defgbchijkl
但是DataFrame的Column Type不能用Function来应用:
var test0 = Seq("abcdefgbchijkl").toDF("col0")
test0
.select(col("col0").replaceFirst("bc","**BC**"))
.show(false)
/*
<console>:230: error: value replaceFirst is not a member of org.apache.spark.sql.Column
.select(col("col0").replaceFirst("bc","**BC**"))
*/
此外,我不知道如何以 SQL 形式使用它:
%sql
-- How to use replaceFirst()
select replaceFirst()
替换第一次出现并不是我看到的 Spark 开箱即用的支持,但可以通过组合几个函数来实现:
Spark >= 3.0.0
import org.apache.spark.sql.functions.{array_join, col, split}
val test0 = Seq("abcdefgbchijkl").toDF("col0") // replaced `var` with `val`
val stringToReplace = "bc"
val replacement = "**BC**"
test0
// create a temporary column, splitting the string by the first occurrence of `bc`
.withColumn("temp", split(col("col0"), stringToReplace, 2))
// recombine the strings before and after `bc` with the desired replacement
.withColumn("col0", array_join(col("temp"), replacement))
// we no longer need this `temp` column
.drop(col("temp"))
.show(false)
给出:
+------------------+
|col0 |
+------------------+
|a**BC**defgbchijkl|
+------------------+
对于(火花)SQL:
-- recombine the strings before and after `bc` with the desired replacement
SELECT tempr[0] || "**BC**" || tempr[1] AS col0
FROM (
-- create a temporary column, splitting the string by the first occurrence of `bc`
SELECT split(col0, "bc", 2) AS tempr
FROM (
SELECT 'abcdefgbchijkl' AS col0
)
)
Spark < 3.0.0(2020 年之前,使用 Spark 2.4.5 测试)
val test0 = Seq("abcdefgbchijkl").toDF("col0")
val stringToReplace = "bc"
val replacement = "**BC**"
val splitFirst = udf { (s: String) => s.split(stringToReplace, 2) }
spark.udf.register("splitFirst", splitFirst) // if you're using Spark SQL
test0
// create a temporary column, splitting the string by the first occurrence of `bc`
.withColumn("temp", splitFirst(col("col0")))
// recombine the strings before and after `bc` with the desired replacement
.withColumn("col0", array_join(col("temp"), replacement))
// we no longer need this `temp` column
.drop(col("temp"))
.show(false)
给出:
+------------------+
|col0 |
+------------------+
|a**BC**defgbchijkl|
+------------------+
对于(火花)SQL:
-- recombine the strings before and after `bc` with the desired replacement
SELECT tempr[0] || "**BC**" || tempr[1] AS col0
FROM (
-- create a temporary column, splitting the string by the first occurrence of `bc`
SELECT splitFirst(col0) AS tempr -- `splitFirst` was registered above
FROM (
SELECT 'abcdefgbchijkl' AS col0
)
)
我想在 spark scala sql 中使用 replaceFirst()
函数。
要么
是否可以在 spark scala 数据帧中使用 replaceFirst()
函数?
不使用 UDF 是否可行?
我要做的功能是:
println("abcdefgbchijkl".replaceFirst("bc","**BC**"))
// a**BC**defgbchijkl
但是DataFrame的Column Type不能用Function来应用:
var test0 = Seq("abcdefgbchijkl").toDF("col0")
test0
.select(col("col0").replaceFirst("bc","**BC**"))
.show(false)
/*
<console>:230: error: value replaceFirst is not a member of org.apache.spark.sql.Column
.select(col("col0").replaceFirst("bc","**BC**"))
*/
此外,我不知道如何以 SQL 形式使用它:
%sql
-- How to use replaceFirst()
select replaceFirst()
替换第一次出现并不是我看到的 Spark 开箱即用的支持,但可以通过组合几个函数来实现:
Spark >= 3.0.0
import org.apache.spark.sql.functions.{array_join, col, split}
val test0 = Seq("abcdefgbchijkl").toDF("col0") // replaced `var` with `val`
val stringToReplace = "bc"
val replacement = "**BC**"
test0
// create a temporary column, splitting the string by the first occurrence of `bc`
.withColumn("temp", split(col("col0"), stringToReplace, 2))
// recombine the strings before and after `bc` with the desired replacement
.withColumn("col0", array_join(col("temp"), replacement))
// we no longer need this `temp` column
.drop(col("temp"))
.show(false)
给出:
+------------------+
|col0 |
+------------------+
|a**BC**defgbchijkl|
+------------------+
对于(火花)SQL:
-- recombine the strings before and after `bc` with the desired replacement
SELECT tempr[0] || "**BC**" || tempr[1] AS col0
FROM (
-- create a temporary column, splitting the string by the first occurrence of `bc`
SELECT split(col0, "bc", 2) AS tempr
FROM (
SELECT 'abcdefgbchijkl' AS col0
)
)
Spark < 3.0.0(2020 年之前,使用 Spark 2.4.5 测试)
val test0 = Seq("abcdefgbchijkl").toDF("col0")
val stringToReplace = "bc"
val replacement = "**BC**"
val splitFirst = udf { (s: String) => s.split(stringToReplace, 2) }
spark.udf.register("splitFirst", splitFirst) // if you're using Spark SQL
test0
// create a temporary column, splitting the string by the first occurrence of `bc`
.withColumn("temp", splitFirst(col("col0")))
// recombine the strings before and after `bc` with the desired replacement
.withColumn("col0", array_join(col("temp"), replacement))
// we no longer need this `temp` column
.drop(col("temp"))
.show(false)
给出:
+------------------+
|col0 |
+------------------+
|a**BC**defgbchijkl|
+------------------+
对于(火花)SQL:
-- recombine the strings before and after `bc` with the desired replacement
SELECT tempr[0] || "**BC**" || tempr[1] AS col0
FROM (
-- create a temporary column, splitting the string by the first occurrence of `bc`
SELECT splitFirst(col0) AS tempr -- `splitFirst` was registered above
FROM (
SELECT 'abcdefgbchijkl' AS col0
)
)