有没有办法在 Google BigQuery 中测量字符串相似性

Is there a way to measure string similarity in Google BigQuery

我想知道是否有人知道在 BigQuery 中测量字符串相似性的方法。

似乎是一个不错的功能。

我的情况是我需要比较两个 url 的相似性,因为我想相当确定它们指的是同一篇文章。

我可以找到 examples using javascript 所以也许 UDF 是可行的方法,但我根本没有使用过 UDF(或者 javascript 就此而言 :))

只是想知道是否可以使用现有的正则表达式函数,或者是否有人可以帮助我开始将 javascript 示例移植到 UDF 中。

非常感谢任何帮助,谢谢

编辑:添加一些示例代码

所以如果我有一个 UDF 定义为:

// distance function

function levenshteinDistance (row, emit) {

  //if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
  if (typeof row.inputA === 'undefined') {var myresult = 1};
  if (typeof row.inputB === 'undefined') {var myresult = 1};
  //if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};

    var myresult = Math.min(
        levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
        levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
        levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
    ) + 1;

  emit({outputA: myresult})

}

bigquery.defineFunction(
  'levenshteinDistance',                           // Name of the function exported to SQL
  ['inputA', 'inputB'],                    // Names of input columns
  [{'name': 'outputA', 'type': 'integer'}],  // Output schema
  levenshteinDistance                       // Reference to JavaScript UDF
);

// make a test function to test individual parts

function test(row, emit) {
  if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
  emit({outputA: x});
}

bigquery.defineFunction(
  'test',                           // Name of the function exported to SQL
  ['inputA', 'inputB'],                    // Names of input columns
  [{'name': 'outputA', 'type': 'integer'}],  // Output schema
  test                       // Reference to JavaScript UDF
);

我尝试使用以下查询进行测试:

SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))

我收到错误:

错误:TypeError:无法读取第 11 行第 38-39 列未定义的 属性 'substr' 错误位置:用户自定义函数

似乎 row.inputA 可能不是字符串,或者由于某种原因字符串函数无法处理它。不确定这是类型问题还是关于 UDF 默认能够使用的实用程序的有趣问题。

再次感谢您的帮助,谢谢。

通过 JS Levenshtein 将是可行的方法。您可以使用该算法来获得绝对字符串距离,或者通过简单地计算 abs(strlen - distance / strlen).

将其转换为百分比相似度

最简单的实现方法是定义一个 Levenshtein UDF,它接受两个输入 a 和 b,并计算它们之间的距离。该函数可以 return a、b 和距离。

要调用它,您需要将两个 URL 作为别名为 'a' 和 'b':

的列传递
SELECT a, b, distance
FROM
  Levenshtein(
     SELECT
       some_url AS a, other_url AS b
     FROM
       your_table
  )

我找不到这个问题的直接答案,所以我提出了这个标准的解决方案 SQL

#standardSQL
CREATE TEMP FUNCTION HammingDistance(a STRING, b STRING) AS (
  (
  SELECT
    SUM(counter) AS diff
  FROM (
    SELECT
      CASE
        WHEN X.value != Y.value THEN 1
        ELSE 0
      END AS counter
    FROM (
      SELECT
        value,
        ROW_NUMBER() OVER() AS row
      FROM
        UNNEST(SPLIT(a, "")) AS value ) X
    JOIN (
      SELECT
        value,
        ROW_NUMBER() OVER() AS row
      FROM
        UNNEST(SPLIT(b, "")) AS value ) Y
    ON
      X.row = Y.row )
   )
);

WITH Input AS (
  SELECT 'abcdef' AS strings UNION ALL
  SELECT 'defdef' UNION ALL
  SELECT '1bcdef' UNION ALL
  SELECT '1bcde4' UNION ALL
  SELECT '123de4' UNION ALL
  SELECT 'abc123'
)

SELECT strings, 'abcdef' as target, HammingDistance('abcdef', strings) as hamming_distance
FROM Input;

与其他解决方案()相比,它需要两个字符串(相同长度,遵循汉明距离的定义)并输出预期的距离。

下面是使用 WITH OFFSET 而不是 ROW_NUMBER() OVER()

的非常简单的汉明距离版本
#standardSQL
WITH Input AS (
  SELECT 'abcdef' AS strings UNION ALL
  SELECT 'defdef' UNION ALL
  SELECT '1bcdef' UNION ALL
  SELECT '1bcde4' UNION ALL
  SELECT '123de4' UNION ALL
  SELECT 'abc123'
)
SELECT 'abcdef' AS target, strings, 
  (SELECT COUNT(1) 
    FROM UNNEST(SPLIT('abcdef', '')) a WITH OFFSET x
    JOIN UNNEST(SPLIT(strings, '')) b WITH OFFSET y
    ON x = y AND a != b) hamming_distance
FROM Input

如果您熟悉 Python,则可以使用从 GCS 加载的外部库在 BigQuery 中使用 fuzzywuzzy 定义的函数。

步骤

  1. 下载 javascript 版本的 fuzzywuzzy (fuzzball)
  2. 取库的编译文件:dist/fuzzball.umd.min.js并重命名为更清晰的名称(如fuzzball
  3. 将其上传到 google 云存储桶
  4. 创建一个临时函数以在查询中使用库(将 OPTIONS 中的路径设置为相关路径)
CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
  return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
  library="gs://my-bucket/fuzzball.js");

with data as (select "my_test_string" as a, "my_other_string" as b)

SELECT  a, b, token_set_ratio(a, b) from data

尝试 Flookup 获取 Google 表格...它绝对比 Levenshtein 距离快,而且它可以开箱即用地计算相似度百分比。 您可能会发现有用的一个 Flookup 函数是:

FUZZYMATCH (string1, string2)

参数详情

  1. string1:与 string2 比较。
  2. string2:与 string1 比较。

然后根据这些比较计算相似度百分比。两个参数都可以是范围。

我目前正在尝试针对大型数据集对其进行优化,因此非常欢迎您feedback

编辑:我是 Flookup 的创建者。

准备使用共享 UDF - Levenshtein 距离:

SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
 , fhoffa.x.levenshtein('googgle', 'goggles')
 , fhoffa.x.levenshtein('is this the', 'Is This The')

6  2  0

Soundex:

SELECT fhoffa.x.soundex('felipe')
 , fhoffa.x.soundex('googgle')
 , fhoffa.x.soundex('guugle')

F410  G240  G240

模糊二选一:

SELECT fhoffa.x.fuzzy_extract_one('jony' 
  , (SELECT ARRAY_AGG(name) 
   FROM `fh-bigquery.popular_names.gender_probabilities`) 
  #, ['john', 'johnny', 'jonathan', 'jonas']
)

johnny

操作方法:

当我在寻找上面的答案 Felipe 时,我处理了自己的查询并最终得到了两个版本,一个我称之为字符串 approximation 和另一个字符串 相似.

首先是查看源字符串和测试字符串的字母之间的最短距离,returns 是 0 和 1 之间的分数,其中 1 是完全匹配。它将始终根据两者中最长的字符串进行评分。事实证明 return 与 Levensthein 距离的结果相似。

#standardSql
CREATE OR REPLACE FUNCTION `myproject.func.stringApproximation`(sourceString STRING, testString STRING) AS (
(select avg(best_result) from (
                              select if(length(testString)<length(sourceString), sourceoffset, testoffset) as ref, 
                              case 
                                when min(result) is null then 0
                                else 1 / (min(result) + 1) 
                              end as best_result,
                              from (
                                       select *,
                                              if(source = test, abs(sourceoffset - (testoffset)),
                                              greatest(length(testString),length(sourceString))) as result
                                       from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
                                                cross join
                                            (select *
                                             from unnest(split(lower(testString),'')) as test with offset as testoffset)
                                       ) as results
                              group  by ref
                                 )
        )
);

第二个是第一个的变体,它会查看匹配距离序列,因此与前面或后面的字符距离相等的匹配字符将计为一个点。这工作得很好,比字符串近似更好,但不如我想的那么好(见下面的示例输出)。

    #standarSql
    CREATE OR REPLACE FUNCTION `myproject.func.stringResemblance`(sourceString STRING, testString STRING) AS (
(
select avg(sequence)
from (
      select ref,
             if(array_length(array(select * from comparison.collection intersect distinct
                                   (select * from comparison.before))) > 0
                    or array_length(array(select * from comparison.collection intersect distinct
                                          (select * from comparison.after))) > 0
                 , 1, 0) as sequence

      from (
               select ref,
                      collection,
                      lag(collection) over (order by ref)  as before,
                      lead(collection) over (order by ref) as after
               from (
                     select if(length(testString) < length(sourceString), sourceoffset, testoffset) as ref,
                            array_agg(result ignore nulls)                                          as collection
                     from (
                              select *,
                                     if(source = test, abs(sourceoffset - (testoffset)), null) as result
                              from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
                                       cross join
                                   (select *
                                    from unnest(split(lower(testString),'')) as test with offset as testoffset)
                              ) as results
                     group by ref
                        )
               ) as comparison
      )

)
);

下面是结果示例:

#standardSQL
with test_subjects as (
  select 'benji' as name union all
  select 'benjamin' union all
  select 'benjamin alan artis' union all
  select 'ben artis' union all
  select 'artis benjamin' 
)

select name, quick.stringApproximation('benjamin artis', name) as approxiamtion, quick.stringResemblance('benjamin artis', name) as resemblance
from test_subjects

order by resemblance desc

这个returns

+---------------------+--------------------+--------------------+
| name                | approximation      | resemblance        |
+---------------------+--------------------+--------------------+
| artis benjamin      | 0.2653061224489796 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| benjamin alan artis | 0.6078947368421053 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| ben artis           | 0.4142857142857142 | 0.7142857142857143 |
+---------------------+--------------------+--------------------+
| benjamin            | 0.6125850340136053 | 0.5714285714285714 |
+---------------------+--------------------+--------------------+
| benji               | 0.36269841269841263| 0.28571428571428575|
+----------------------------------------------------------------

已编辑:更新了相似度算法以改进结果。

did it喜欢这样:

CREATE TEMP FUNCTION trigram_similarity(a STRING, b STRING) AS (
  (
    WITH a_trigrams AS (
      SELECT
        DISTINCT tri_a
      FROM
        unnest(ML.NGRAMS(SPLIT(LOWER(a), ''), [3,3])) AS tri_a
    ),
    b_trigrams AS (
      SELECT
        DISTINCT tri_b
      FROM
        unnest(ML.NGRAMS(SPLIT(LOWER(b), ''), [3,3])) AS tri_b
    )
    SELECT
      COUNTIF(tri_b IS NOT NULL) / COUNT(*)
    FROM
      a_trigrams
      LEFT JOIN b_trigrams ON tri_a = tri_b
  )
);

这是与 Postgres's pg_trgm 的比较:

select trigram_similarity('saemus', 'seamus');
-- 0.25 vs. pg_trgm 0.272727

select trigram_similarity('shamus', 'seamus');
-- 0.5 vs. pg_trgm 0.4

我在

上给出了相同的答案