Spark Sql 正则表达式中包含数组 - 不起作用
Spark Sql Array contains on Regex - doesn't work
我有一个数据框如下
val data = Seq(
"""{"Data": [{ "name": "FName", "value": "Alex" }, { "name": "LName", "value": "Johnson" }]}""",
"""{"Data": [{ "name": "FName", "value": "Alexis" }, { "name": "LName", "value": "Paul" }]}""",
"""{"Data": [{ "name": "FName", "value": "Alexander" }, { "name": "LName", "value": "Strong" }]}""",
"""{"Data": [{ "name": "FName", "value": "Baron" }, { "name": "LName", "value": "Corbin" }]}""",
)
val df = spark.read.json(spark.sparkContext.parallelize(data))
df.createOrReplaceTempView("df")
架构如下
root
|-- Data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- value: string (nullable = true)
以上df的数据输出如下
Data
[{"name":"FName","value":"Alex"},{"name":"LName","value":"Johnson"}]
[{"name":"FName","value":"Alexis"},{"name":"LName","value":"Paul"}]
[{"name":"FName","value":"Alexander"},{"name":"LName","value":"Strong"}]
[{"name":"FName","value":"Baron"},{"name":"LName","value":"Corbin"}]
我需要 Fname 以 'Alex'
开头的所有记录
预期输出
Data
[{"name":"FName","value":"Alex"},{"name":"LName","value":"Johnson"}]
[{"name":"FName","value":"Alexis"},{"name":"LName","value":"Paul"}]
[{"name":"FName","value":"Alexander"},{"name":"LName","value":"Strong"}]
Spark SQL 查询 1:
select * from df where array_contains (Data.value, "Al%")
Spark SQL 查询 2:
select * from df where array_contains (Data.value, "Al*")
这两个查询的结果都是空的。
Spark SQL 查询 3:
select * from df where array_contains (Data.value, "Alex")
结果:
Data
[{"name":"FName","value":"Alex"},{"name":"LName","value":"Johnson"}]
如何在 array_contains 上点赞或正则表达式?
改用exists
函数:
select * from df where exists(Data.value, x -> x like 'Al%')
我有一个数据框如下
val data = Seq(
"""{"Data": [{ "name": "FName", "value": "Alex" }, { "name": "LName", "value": "Johnson" }]}""",
"""{"Data": [{ "name": "FName", "value": "Alexis" }, { "name": "LName", "value": "Paul" }]}""",
"""{"Data": [{ "name": "FName", "value": "Alexander" }, { "name": "LName", "value": "Strong" }]}""",
"""{"Data": [{ "name": "FName", "value": "Baron" }, { "name": "LName", "value": "Corbin" }]}""",
)
val df = spark.read.json(spark.sparkContext.parallelize(data))
df.createOrReplaceTempView("df")
架构如下
root
|-- Data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- value: string (nullable = true)
以上df的数据输出如下
Data
[{"name":"FName","value":"Alex"},{"name":"LName","value":"Johnson"}]
[{"name":"FName","value":"Alexis"},{"name":"LName","value":"Paul"}]
[{"name":"FName","value":"Alexander"},{"name":"LName","value":"Strong"}]
[{"name":"FName","value":"Baron"},{"name":"LName","value":"Corbin"}]
我需要 Fname 以 'Alex'
开头的所有记录预期输出
Data
[{"name":"FName","value":"Alex"},{"name":"LName","value":"Johnson"}]
[{"name":"FName","value":"Alexis"},{"name":"LName","value":"Paul"}]
[{"name":"FName","value":"Alexander"},{"name":"LName","value":"Strong"}]
Spark SQL 查询 1:
select * from df where array_contains (Data.value, "Al%")
Spark SQL 查询 2:
select * from df where array_contains (Data.value, "Al*")
这两个查询的结果都是空的。
Spark SQL 查询 3:
select * from df where array_contains (Data.value, "Alex")
结果:
Data
[{"name":"FName","value":"Alex"},{"name":"LName","value":"Johnson"}]
如何在 array_contains 上点赞或正则表达式?
改用exists
函数:
select * from df where exists(Data.value, x -> x like 'Al%')