Databricks Spark 中的字符串比较 SQL
String comparison in Databricks Spark SQL
select distinct
promo_name
,case
when substring(promo_name,instr(promo_name, "P0"),2) = "P0" then 0
when substring(promo_name,instr(promo_name, "P1"),2) = "P1" then 1
When substring(promo_name,instr(promo_name, "P01"),3) = "P01" then 1
when substring(promo_name,instr(promo_name, "P2"),2) = "P2" then 2
When substring(promo_name,instr(promo_name, "P02"),3) = "P02" then 2
when substring(promo_name,instr(promo_name, "P3"),2) = "P3" then 3
when substring(promo_name,instr(promo_name, "P03"),3) = "P03" then 3
when substring(promo_name,instr(promo_name, "P4"),2) = "P4" then 4
when substring(promo_name,instr(promo_name, "P04"),3) = "P04" then 4
when substring(promo_name,instr(promo_name, "P5"),2) = "P5" then 5
when substring(promo_name,instr(promo_name, "P05"),3) = "P05" then 5
when substring(promo_name,instr(promo_name, "P6"),2) = "P6" then 6
when substring(promo_name,instr(promo_name, "P06"),3) = "P06" then 6
when substring(promo_name,instr(promo_name, "P7"),2) = "P7" then 7
when substring(promo_name,instr(promo_name, "P07"),3) = "P07" then 7
when trim(substring(promo_name,instr(promo_name, "P8"),2)) ="P8" then 8
when trim(substring(promo_name,instr(promo_name, "P08"),3)) ="P08" then 8
when trim(substring(promo_name,instr(promo_name, "P9"),2)) ="P9" then 9
when trim(substring(promo_name,instr(promo_name, "P09"),3)) ="P09" then 9
when trim(substring(promo_name,instr(promo_name, "P10"),3)) ="P10" then 10
when trim(substring(promo_name,instr(promo_name, "P11"),3)) ="P11" then 11
when trim(substring(promo_name,instr(promo_name, "P12"),3)) ="P12" then 12
else 0 结束为 promo_id
,案件
当 trim(substring(promo_name,instr(promo_name,"P10"),3)) = "P10" 然后 10
当 trim(substring(promo_name,instr(promo_name,"P11"),3)) = "P11" 然后 11
当 trim(substring(promo_name,instr(promo_name,"P12"),3)) = "P12" 然后 12
当 trim(substring(promo_name,instr(promo_name,"P13"),3)) = "P13" 然后 13
当 trim(substring(promo_name,instr(promo_name,"P14"),3)) = "P14" 然后 14
否则 0 以 id 结尾
来自 hbi_dns_protected.store_zones_stock_v7_1_4
其中 promo_name 不为空
尝试从字符串中提取 ID,当我在单独的列中使用时,它从 P10 到 P14 工作正常,当我在同一列中使用时,它只选择 1 而不是 11、1 而不是 12 等...
我是不是搞错了?
sample data
代码在第一个匹配处停止,因此“11”匹配“1”。
我建议重新排序并使用 like
:
(case when promo_name like 'P14%' then 14
when promo_name like 'P13%' then 13
. . .
end)
也许您应该提出一个 new 问题,其中包含示例数据和所需结果。可能有更简单的方法。
为什么不使用 regexp_extract
从字符串中提取正则表达式,而不是为每种情况编写代码,例如:
%sql
SELECT *,
regexp_extract( promo_name, ' P(\d+)', 1 ) AS promoNumber
FROM tmp
我的结果:
注意正则表达式区分大小写。如果您需要捕获小写或大写 Ps,那么您可以使用字符 class,即 [pP]
。
所用 RegEx 模式的完整解释:
- 正则表达式以 space 字符和大写 P 开头。这将按字面意思匹配 space 和大写 P。如果你想让匹配不区分大小写,你可以使用字符 class 例如
[pP]
表示匹配括号中的任何字符(区分大小写)
- RegEx 的下一个组成部分是
(\d+)
。这由用于匹配数字的 RegEx 模式 \d
组成,+
符号表示 'match one or more'。括号使它成为一个组,即第 1 组。\d
有一个额外的斜线,它是 regexp_extract
. 的 Spark SQL 实现所需的转义字符。
regexp_extract
的最后一个参数值为 1,表示 'return group 1 from the function'
我使用 regex101.com 来测试和练习 RegEx 表达式。
select distinct
promo_name
,case
when substring(promo_name,instr(promo_name, "P0"),2) = "P0" then 0
when substring(promo_name,instr(promo_name, "P1"),2) = "P1" then 1
When substring(promo_name,instr(promo_name, "P01"),3) = "P01" then 1
when substring(promo_name,instr(promo_name, "P2"),2) = "P2" then 2
When substring(promo_name,instr(promo_name, "P02"),3) = "P02" then 2
when substring(promo_name,instr(promo_name, "P3"),2) = "P3" then 3
when substring(promo_name,instr(promo_name, "P03"),3) = "P03" then 3
when substring(promo_name,instr(promo_name, "P4"),2) = "P4" then 4
when substring(promo_name,instr(promo_name, "P04"),3) = "P04" then 4
when substring(promo_name,instr(promo_name, "P5"),2) = "P5" then 5
when substring(promo_name,instr(promo_name, "P05"),3) = "P05" then 5
when substring(promo_name,instr(promo_name, "P6"),2) = "P6" then 6
when substring(promo_name,instr(promo_name, "P06"),3) = "P06" then 6
when substring(promo_name,instr(promo_name, "P7"),2) = "P7" then 7
when substring(promo_name,instr(promo_name, "P07"),3) = "P07" then 7
when trim(substring(promo_name,instr(promo_name, "P8"),2)) ="P8" then 8
when trim(substring(promo_name,instr(promo_name, "P08"),3)) ="P08" then 8
when trim(substring(promo_name,instr(promo_name, "P9"),2)) ="P9" then 9
when trim(substring(promo_name,instr(promo_name, "P09"),3)) ="P09" then 9
when trim(substring(promo_name,instr(promo_name, "P10"),3)) ="P10" then 10
when trim(substring(promo_name,instr(promo_name, "P11"),3)) ="P11" then 11
when trim(substring(promo_name,instr(promo_name, "P12"),3)) ="P12" then 12
else 0 结束为 promo_id ,案件 当 trim(substring(promo_name,instr(promo_name,"P10"),3)) = "P10" 然后 10 当 trim(substring(promo_name,instr(promo_name,"P11"),3)) = "P11" 然后 11 当 trim(substring(promo_name,instr(promo_name,"P12"),3)) = "P12" 然后 12 当 trim(substring(promo_name,instr(promo_name,"P13"),3)) = "P13" 然后 13 当 trim(substring(promo_name,instr(promo_name,"P14"),3)) = "P14" 然后 14 否则 0 以 id 结尾 来自 hbi_dns_protected.store_zones_stock_v7_1_4 其中 promo_name 不为空
尝试从字符串中提取 ID,当我在单独的列中使用时,它从 P10 到 P14 工作正常,当我在同一列中使用时,它只选择 1 而不是 11、1 而不是 12 等...
我是不是搞错了? sample data
代码在第一个匹配处停止,因此“11”匹配“1”。
我建议重新排序并使用 like
:
(case when promo_name like 'P14%' then 14
when promo_name like 'P13%' then 13
. . .
end)
也许您应该提出一个 new 问题,其中包含示例数据和所需结果。可能有更简单的方法。
为什么不使用 regexp_extract
从字符串中提取正则表达式,而不是为每种情况编写代码,例如:
%sql
SELECT *,
regexp_extract( promo_name, ' P(\d+)', 1 ) AS promoNumber
FROM tmp
我的结果:
注意正则表达式区分大小写。如果您需要捕获小写或大写 Ps,那么您可以使用字符 class,即 [pP]
。
所用 RegEx 模式的完整解释:
- 正则表达式以 space 字符和大写 P 开头。这将按字面意思匹配 space 和大写 P。如果你想让匹配不区分大小写,你可以使用字符 class 例如
[pP]
表示匹配括号中的任何字符(区分大小写) - RegEx 的下一个组成部分是
(\d+)
。这由用于匹配数字的 RegEx 模式\d
组成,+
符号表示 'match one or more'。括号使它成为一个组,即第 1 组。\d
有一个额外的斜线,它是regexp_extract
. 的 Spark SQL 实现所需的转义字符。
regexp_extract
的最后一个参数值为 1,表示 'return group 1 from the function'
我使用 regex101.com 来测试和练习 RegEx 表达式。