RedShift 中字符串列值比较的函数
Function for string column values comparison in RedShift
我在 Redshift 中使用 ilike 进行不区分大小写的列字符串比较。为了更好地处理我做的字符串大小
select *
from testset t
join predictions p
on p.id = c.id and
c.text ilike '%'+p.text+'%'
union
select *
from testset t
join predictions p
on p.id = c.id and
p.text ilike '%'+c.text+'%'
为了考虑length(c) > length(p) 反之亦然。顺便说一下,like 有几个限制,比如 when
p.text = "TOKEN1 TOKEN2 TOKEN3"
c.text = "TOKEN1 TOKEN3"
这是行不通的。
我当时想使用 Redshift function (or python function), but I'm not sure how I can support things like Levenshtein distance, string similarity (with threshold), etc. in that function (and if it is possibile), using available libraries for UDF Python Functions.
我现在的样子
create or replace function f_compare(a VARCHAR, b VARCHAR) returns float IMMUTABLE as $$
def diff(strL, strR):
from difflib import SequenceMatcher
ratio = SequenceMatcher(None, strL, strR).ratio()
return ratio
return diff(a,b)
return f_compare(a,b)
$$ LANGUAGE plpythonu;
with samples as (
select
cast('TOKEN1 TOKEN3' as VARCHAR) as name,
cast('TOKEN1 TOKEN2 TOKEN3' as VARCHAR) as name1
)
select f_compare(name, name1) from samples
您可以使用 python 库而不是编写自己的库。 TheFuzz 是一个非常受欢迎的。
It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.
UDF
CREATE FUNCTION fuzzy_test (string_a TEXT,string_b TEXT) RETURNS FLOAT IMMUTABLE
AS
$$
FROM fuzzywuzzy import fuzz
RETURN fuzz.ratio (string_a,string_b)
$$ LANGUAGE plpythonu;
查询
SELECT fuzzy_test('brooklyn bridge', 'brooklin bridge');
-- Output
-- 93
不过,您必须将该库导入到您的 redshift 集群中。从 github 下载 fuzzywuzzy 存储库,将其压缩,将其上传到您的 S3 并使用它创建一个库。
CREATE LIBRARY fuzzywuzzy LANGUAGE plpythonu FROM 's3://<bucket_name>/fuzzywuzzy.zip' CREDENTIALS 'aws_access_key_id=<access key id>;aws_secret_access_key=<secret key>'
我在 Redshift 中使用 ilike 进行不区分大小写的列字符串比较。为了更好地处理我做的字符串大小
select *
from testset t
join predictions p
on p.id = c.id and
c.text ilike '%'+p.text+'%'
union
select *
from testset t
join predictions p
on p.id = c.id and
p.text ilike '%'+c.text+'%'
为了考虑length(c) > length(p) 反之亦然。顺便说一下,like 有几个限制,比如 when
p.text = "TOKEN1 TOKEN2 TOKEN3"
c.text = "TOKEN1 TOKEN3"
这是行不通的。 我当时想使用 Redshift function (or python function), but I'm not sure how I can support things like Levenshtein distance, string similarity (with threshold), etc. in that function (and if it is possibile), using available libraries for UDF Python Functions.
我现在的样子
create or replace function f_compare(a VARCHAR, b VARCHAR) returns float IMMUTABLE as $$
def diff(strL, strR):
from difflib import SequenceMatcher
ratio = SequenceMatcher(None, strL, strR).ratio()
return ratio
return diff(a,b)
return f_compare(a,b)
$$ LANGUAGE plpythonu;
with samples as (
select
cast('TOKEN1 TOKEN3' as VARCHAR) as name,
cast('TOKEN1 TOKEN2 TOKEN3' as VARCHAR) as name1
)
select f_compare(name, name1) from samples
您可以使用 python 库而不是编写自己的库。 TheFuzz 是一个非常受欢迎的。
It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.
UDF
CREATE FUNCTION fuzzy_test (string_a TEXT,string_b TEXT) RETURNS FLOAT IMMUTABLE
AS
$$
FROM fuzzywuzzy import fuzz
RETURN fuzz.ratio (string_a,string_b)
$$ LANGUAGE plpythonu;
查询
SELECT fuzzy_test('brooklyn bridge', 'brooklin bridge');
-- Output
-- 93
不过,您必须将该库导入到您的 redshift 集群中。从 github 下载 fuzzywuzzy 存储库,将其压缩,将其上传到您的 S3 并使用它创建一个库。
CREATE LIBRARY fuzzywuzzy LANGUAGE plpythonu FROM 's3://<bucket_name>/fuzzywuzzy.zip' CREDENTIALS 'aws_access_key_id=<access key id>;aws_secret_access_key=<secret key>'