RedShift 中字符串列值比较的函数

Function for string column values comparison in RedShift

我在 Redshift 中使用 ilike 进行不区分大小写的列字符串比较。为了更好地处理我做的字符串大小

select *
    from testset t
    join predictions p
    on p.id = c.id and
    c.text ilike '%'+p.text+'%'

union

select *
    from testset t
    join predictions p
    on p.id = c.id and
    p.text ilike '%'+c.text+'%'

为了考虑length(c) > length(p) 反之亦然。顺便说一下,like 有几个限制,比如 when

p.text = "TOKEN1 TOKEN2 TOKEN3"
c.text = "TOKEN1 TOKEN3"

这是行不通的。 我当时想使用 Redshift function (or python function), but I'm not sure how I can support things like Levenshtein distance, string similarity (with threshold), etc. in that function (and if it is possibile), using available libraries for UDF Python Functions.

我现在的样子

create or replace function f_compare(a VARCHAR, b VARCHAR) returns float IMMUTABLE as $$
    def diff(strL, strR):
        from difflib import SequenceMatcher
        ratio = SequenceMatcher(None, strL, strR).ratio()
        return ratio
    return diff(a,b)
    return f_compare(a,b)
$$ LANGUAGE plpythonu;  

with samples as (
    select 
    cast('TOKEN1 TOKEN3' as VARCHAR) as name,
    cast('TOKEN1 TOKEN2 TOKEN3' as VARCHAR) as name1
)

select f_compare(name, name1) from samples

您可以使用 python 库而不是编写自己的库。 TheFuzz 是一个非常受欢迎的。

It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

UDF

CREATE FUNCTION fuzzy_test (string_a TEXT,string_b TEXT) RETURNS FLOAT IMMUTABLE
AS
$$
  FROM fuzzywuzzy import fuzz 
  RETURN fuzz.ratio (string_a,string_b) 
$$ LANGUAGE plpythonu;

查询

SELECT fuzzy_test('brooklyn bridge', 'brooklin bridge');
-- Output
-- 93

不过,您必须将该库导入到您的 redshift 集群中。从 github 下载 fuzzywuzzy 存储库,将其压缩,将其上传到您的 S3 并使用它创建一个库。

CREATE LIBRARY fuzzywuzzy LANGUAGE plpythonu FROM 's3://<bucket_name>/fuzzywuzzy.zip' CREDENTIALS 'aws_access_key_id=<access key id>;aws_secret_access_key=<secret key>'