如何使用定界符在火花中爆炸

How to explode in spark with delimiter

我有一个 table :

id itemNames 优惠券 1 件商品 (foo bar) 有货,soaps true 2 项(条)可用 false 3 香皂、洗发水假 4 项 (foo bar, bar) 可用 true 有 5 件商品(foo bar、bar)可用,(肥皂、洗发水) true 6 空假

这个我要爆到

id itemNames 优惠券 1 项 (foo bar) 可用 true
1 块肥皂 2 项(条)可用 false 3 肥皂假 3 耻辱假 4 项 (foo bar, bar) 可用 true 5 项 (foo bar, bar) 可用 true 6(肥皂、洗发水)是的
6 空真

当我这样做时:

 df.withColumn("itemNames", explode(split($"itemNames", "[,]")))

我得到:

itemNames                                          coupons
item (foo bar) is available                        true       
soaps                                              true 
item (bar) is available                            false
soaps                                              false
shampoo                                            false
item (foo bar,                                     true
bar) is available                                  true 
(soap,                                             true    
shampoo)                                           true

有人可以告诉我我做错了什么吗?我该如何纠正?这里常见的一种模式是逗号出现在 ().

您的问题没有从后向拆分字符串的模式。以下是一种解决方法,适用于这种特殊情况。我使用 lookbehind 操作除以 "available,"。在你的数据框中试试这个 explode

scala> "item (foo bar) is available, soaps".split("(?<=available),")
res41: Array[String] = Array(item (foo bar) is available, " soaps")

scala> "item (foo bar) is available, soaps".split("(?<=available),").length
res42: Int = 2

scala> "item (foo bar, bar) is available".split("(?<=available),")
res44: Array[String] = Array(item (foo bar, bar) is available)

scala> "item (foo bar, bar) is available".split("(?<=available),").length
res45: Int = 1

EDIT1

scala> "item (foo bar, bar) is empty, (soap, shampoo)".split("(?<=available|empty),").length
res1: Int = 2

scala>

使用 UDF 并受到 Regex to match only commas not in parentheses? 的启发:

val df = List(
  ("item (foo bar) is available, soaps", true),
  ("item (bar) is available", false),
  ("soaps, shampoo", false),
  ("item (foo bar, bar) is available", true),
  ("item (foo bar, bar) is available, (soap, shampoo)", true)
).
  toDF("itemNames", "coupons")
df.show(false)

val regex = Pattern.compile(
  ",         # Match a comma\n" +
    "(?!       # only if it's not followed by...\n" +
    " [^(]*    #   any number of characters except opening parens\n" +
    " \)      #   followed by a closing parens\n" +
    ")         # End of lookahead",
  Pattern.COMMENTS)

val customSplit = (value: String) => regex.split(value)
val customSplitUDF = udf(customSplit)
val result = df.withColumn("itemNames", explode(customSplitUDF($"itemNames")))
result.show(false)

输出为:

+--------------------------------+-------+
|itemNames                       |coupons|
+--------------------------------+-------+
|item (foo bar) is available     |true   |
| soaps                          |true   |
|item (bar) is available         |false  |
|soaps                           |false  |
| shampoo                        |false  |
|item (foo bar, bar) is available|true   |
|item (foo bar, bar) is available|true   |
| (soap, shampoo)                |true   |
+--------------------------------+-------+

如果需要"trim",可以轻松添加到"customSplit"。