雪花正则表达式

Question

我在 Snowflake 列中有这个字符串：

\[
{
"entryListId": 3279,
"id": 4617,
"name": "SpecTra",
"type": 0
},
{
"entryListId": 3279,
"id": 7455,
"name": "Signal Capital Partners",
"type": 0
}
\]

无论公司名称有多少，我都需要以这种格式获取名称：“SpecTra，Signal Capital Partners”。换句话说，我需要提取公司名称并将它们连接起来。

我已经试过了：

regexp_replace(col, '"(\[^"\]+)"|.', '\1|')

和 regexp_substr()函数，但没有得到想要的输出

你能帮我解决这个问题吗？谢谢

Answer 1

因此将您的文本 blob 推入 CTE。

with data as (
    SELECT * FROM VALUES
    ('[{"entryListId": 3279,"id": 4617,"name": "SpecTra","type": 0},{"entryListId": 3279,"id": 7455,"name": "Signal Capital Partners","type": 0}]')
    t(str)
)

我不禁注意到它是 JSON，所以让 PARSE_JSON that and then FLATTEN 它，这是你的“名字”

select 
    d.*
    ,f.value:name::text as name
from data d
    ,table(flatten(input=>parse_json(d.str))) f

给予：

STR	NAME
[{"entryListId": 3279,"id": 4617,"name": "SpecTra","type": 0},{"entryListId": 3279,"id": 7455,"name": "Signal Capital Partners","type": 0}]	SpecTra
[{"entryListId": 3279,"id": 4617,"name": "SpecTra","type": 0},{"entryListId": 3279,"id": 7455,"name": "Signal Capital Partners","type": 0}]	Signal Capital Partners

因此使用 LISTAGG

进行聚合

select 
    listagg(f.value:name::text, ',') as names
from data d
    ,table(flatten(input=>parse_json(d.str))) f

给出：

NAMES
SpecTra,Signal Capital Partners

重复数据：

您可以将 DISTINCT 添加到 LISTAGG 中，并且只保留不同的值，但考虑到这是一种成本，我确实指出了这一点，您没有提到重复数据。

with data as (
    SELECT * FROM VALUES
    ('[
    {
"entryListId": 3279,
"id": 4617,
"name": "SpecTra",
"type": 0
},   
{
"entryListId": 3279,
"id": 4617,
"name": "SpecTra",
"type": 0
},
{
"entryListId": 3279,
"id": 7455,
"name": "Signal Capital Partners",
"type": 0
}]')
    t(str)
)
select 
    listagg(distinct f.value:name::text, ',') as names
from data d
    ,table(flatten(input=>parse_json(d.str))) f;

给出：

NAMES
SpecTra,Signal Capital Partners

Where-as 正则表达式解决方案无法处理这种情况：

with data as (
    SELECT * FROM VALUES
    ('[
    {
"entryListId": 3279,
"id": 4617,
"name": "SpecTra",
"type": 0
},   
{
"entryListId": 3279,
"id": 4617,
"name": "SpecTra",
"type": 0
},
{
"entryListId": 3279,
"id": 7455,
"name": "Signal Capital Partners",
"type": 0
}]')
    t(str)
)
select 
    trim(regexp_replace(regexp_replace(d.str, '"name":\s*"([^"]+)"|.', '\1,'), ',+', ','), ',') as regexp_replace
from data d

给出：

REGEXP_REPLACE
, , , SpecTra, , , , , , SpecTra, , , , , , Signal Capital Partners, ,

Answer 2

您可以使用

trim(regexp_replace(regexp_replace(col, '"name":\s*"([^"]+)"|.', '\1,'), ',+', ','), ',')

详情:

"name":\s*"([^"]+)"|. 正则表达式匹配 "name":，然后是零个或多个空格，然后是 "，然后将 " 以外的任何一个或多个字符捕获到组 1 中然后匹配一个 " 字符，并替换为第 1 组和一个逗号
第二个 regexp_replace 将所有逗号缩减为单个出现的逗号，,+ 匹配一个或多个逗号（您也可以在此处使用更具体的 ,{2,} 模式）
trim 删除开头和结尾的逗号。

雪花正则表达式

Snowflake Regular Expression

regex

sql

substring

snowflake-cloud-data-platform

重复数据：