如何使用定界符在火花中爆炸
How to explode in spark with delimiter
我有一个 table :
id itemNames 优惠券
1 件商品 (foo bar) 有货,soaps true
2 项(条)可用 false
3 香皂、洗发水假
4 项 (foo bar, bar) 可用 true
有 5 件商品(foo bar、bar)可用,(肥皂、洗发水) true
6 空假
这个我要爆到
id itemNames 优惠券
1 项 (foo bar) 可用 true
1 块肥皂
2 项(条)可用 false
3 肥皂假
3 耻辱假
4 项 (foo bar, bar) 可用 true
5 项 (foo bar, bar) 可用 true
6(肥皂、洗发水)是的
6 空真
当我这样做时:
df.withColumn("itemNames", explode(split($"itemNames", "[,]")))
我得到:
itemNames coupons
item (foo bar) is available true
soaps true
item (bar) is available false
soaps false
shampoo false
item (foo bar, true
bar) is available true
(soap, true
shampoo) true
有人可以告诉我我做错了什么吗?我该如何纠正?这里常见的一种模式是逗号出现在 ().
中
您的问题没有从后向拆分字符串的模式。以下是一种解决方法,适用于这种特殊情况。我使用 lookbehind 操作除以 "available,"。在你的数据框中试试这个 explode
scala> "item (foo bar) is available, soaps".split("(?<=available),")
res41: Array[String] = Array(item (foo bar) is available, " soaps")
scala> "item (foo bar) is available, soaps".split("(?<=available),").length
res42: Int = 2
scala> "item (foo bar, bar) is available".split("(?<=available),")
res44: Array[String] = Array(item (foo bar, bar) is available)
scala> "item (foo bar, bar) is available".split("(?<=available),").length
res45: Int = 1
EDIT1
scala> "item (foo bar, bar) is empty, (soap, shampoo)".split("(?<=available|empty),").length
res1: Int = 2
scala>
使用 UDF 并受到 Regex to match only commas not in parentheses? 的启发:
val df = List(
("item (foo bar) is available, soaps", true),
("item (bar) is available", false),
("soaps, shampoo", false),
("item (foo bar, bar) is available", true),
("item (foo bar, bar) is available, (soap, shampoo)", true)
).
toDF("itemNames", "coupons")
df.show(false)
val regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS)
val customSplit = (value: String) => regex.split(value)
val customSplitUDF = udf(customSplit)
val result = df.withColumn("itemNames", explode(customSplitUDF($"itemNames")))
result.show(false)
输出为:
+--------------------------------+-------+
|itemNames |coupons|
+--------------------------------+-------+
|item (foo bar) is available |true |
| soaps |true |
|item (bar) is available |false |
|soaps |false |
| shampoo |false |
|item (foo bar, bar) is available|true |
|item (foo bar, bar) is available|true |
| (soap, shampoo) |true |
+--------------------------------+-------+
如果需要"trim",可以轻松添加到"customSplit"。
我有一个 table :
id itemNames 优惠券 1 件商品 (foo bar) 有货,soaps true 2 项(条)可用 false 3 香皂、洗发水假 4 项 (foo bar, bar) 可用 true 有 5 件商品(foo bar、bar)可用,(肥皂、洗发水) true 6 空假
这个我要爆到
id itemNames 优惠券
1 项 (foo bar) 可用 true
1 块肥皂
2 项(条)可用 false
3 肥皂假
3 耻辱假
4 项 (foo bar, bar) 可用 true
5 项 (foo bar, bar) 可用 true
6(肥皂、洗发水)是的
6 空真
当我这样做时:
df.withColumn("itemNames", explode(split($"itemNames", "[,]")))
我得到:
itemNames coupons
item (foo bar) is available true
soaps true
item (bar) is available false
soaps false
shampoo false
item (foo bar, true
bar) is available true
(soap, true
shampoo) true
有人可以告诉我我做错了什么吗?我该如何纠正?这里常见的一种模式是逗号出现在 ().
中您的问题没有从后向拆分字符串的模式。以下是一种解决方法,适用于这种特殊情况。我使用 lookbehind 操作除以 "available,"。在你的数据框中试试这个 explode
scala> "item (foo bar) is available, soaps".split("(?<=available),")
res41: Array[String] = Array(item (foo bar) is available, " soaps")
scala> "item (foo bar) is available, soaps".split("(?<=available),").length
res42: Int = 2
scala> "item (foo bar, bar) is available".split("(?<=available),")
res44: Array[String] = Array(item (foo bar, bar) is available)
scala> "item (foo bar, bar) is available".split("(?<=available),").length
res45: Int = 1
EDIT1
scala> "item (foo bar, bar) is empty, (soap, shampoo)".split("(?<=available|empty),").length
res1: Int = 2
scala>
使用 UDF 并受到 Regex to match only commas not in parentheses? 的启发:
val df = List(
("item (foo bar) is available, soaps", true),
("item (bar) is available", false),
("soaps, shampoo", false),
("item (foo bar, bar) is available", true),
("item (foo bar, bar) is available, (soap, shampoo)", true)
).
toDF("itemNames", "coupons")
df.show(false)
val regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS)
val customSplit = (value: String) => regex.split(value)
val customSplitUDF = udf(customSplit)
val result = df.withColumn("itemNames", explode(customSplitUDF($"itemNames")))
result.show(false)
输出为:
+--------------------------------+-------+
|itemNames |coupons|
+--------------------------------+-------+
|item (foo bar) is available |true |
| soaps |true |
|item (bar) is available |false |
|soaps |false |
| shampoo |false |
|item (foo bar, bar) is available|true |
|item (foo bar, bar) is available|true |
| (soap, shampoo) |true |
+--------------------------------+-------+
如果需要"trim",可以轻松添加到"customSplit"。