有没有办法在 Google BigQuery 中测量字符串相似性
Is there a way to measure string similarity in Google BigQuery
我想知道是否有人知道在 BigQuery 中测量字符串相似性的方法。
似乎是一个不错的功能。
我的情况是我需要比较两个 url 的相似性,因为我想相当确定它们指的是同一篇文章。
我可以找到 examples using javascript 所以也许 UDF 是可行的方法,但我根本没有使用过 UDF(或者 javascript 就此而言 :))
只是想知道是否可以使用现有的正则表达式函数,或者是否有人可以帮助我开始将 javascript 示例移植到 UDF 中。
非常感谢任何帮助,谢谢
编辑:添加一些示例代码
所以如果我有一个 UDF 定义为:
// distance function
function levenshteinDistance (row, emit) {
//if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
if (typeof row.inputA === 'undefined') {var myresult = 1};
if (typeof row.inputB === 'undefined') {var myresult = 1};
//if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};
var myresult = Math.min(
levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
) + 1;
emit({outputA: myresult})
}
bigquery.defineFunction(
'levenshteinDistance', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
levenshteinDistance // Reference to JavaScript UDF
);
// make a test function to test individual parts
function test(row, emit) {
if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
emit({outputA: x});
}
bigquery.defineFunction(
'test', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
test // Reference to JavaScript UDF
);
我尝试使用以下查询进行测试:
SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))
我收到错误:
错误:TypeError:无法读取第 11 行第 38-39 列未定义的 属性 'substr'
错误位置:用户自定义函数
似乎 row.inputA 可能不是字符串,或者由于某种原因字符串函数无法处理它。不确定这是类型问题还是关于 UDF 默认能够使用的实用程序的有趣问题。
再次感谢您的帮助,谢谢。
通过 JS Levenshtein 将是可行的方法。您可以使用该算法来获得绝对字符串距离,或者通过简单地计算 abs(strlen - distance / strlen).
将其转换为百分比相似度
最简单的实现方法是定义一个 Levenshtein UDF,它接受两个输入 a 和 b,并计算它们之间的距离。该函数可以 return a、b 和距离。
要调用它,您需要将两个 URL 作为别名为 'a' 和 'b':
的列传递
SELECT a, b, distance
FROM
Levenshtein(
SELECT
some_url AS a, other_url AS b
FROM
your_table
)
我找不到这个问题的直接答案,所以我提出了这个标准的解决方案 SQL
#standardSQL
CREATE TEMP FUNCTION HammingDistance(a STRING, b STRING) AS (
(
SELECT
SUM(counter) AS diff
FROM (
SELECT
CASE
WHEN X.value != Y.value THEN 1
ELSE 0
END AS counter
FROM (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(a, "")) AS value ) X
JOIN (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(b, "")) AS value ) Y
ON
X.row = Y.row )
)
);
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT strings, 'abcdef' as target, HammingDistance('abcdef', strings) as hamming_distance
FROM Input;
与其他解决方案()相比,它需要两个字符串(相同长度,遵循汉明距离的定义)并输出预期的距离。
下面是使用 WITH OFFSET
而不是 ROW_NUMBER() OVER()
的非常简单的汉明距离版本
#standardSQL
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT 'abcdef' AS target, strings,
(SELECT COUNT(1)
FROM UNNEST(SPLIT('abcdef', '')) a WITH OFFSET x
JOIN UNNEST(SPLIT(strings, '')) b WITH OFFSET y
ON x = y AND a != b) hamming_distance
FROM Input
如果您熟悉 Python,则可以使用从 GCS 加载的外部库在 BigQuery 中使用 fuzzywuzzy 定义的函数。
步骤:
- 下载 javascript 版本的 fuzzywuzzy (fuzzball)
- 取库的编译文件:dist/fuzzball.umd.min.js并重命名为更清晰的名称(如
fuzzball
)
- 将其上传到 google 云存储桶
- 创建一个临时函数以在查询中使用库(将 OPTIONS 中的路径设置为相关路径)
CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
library="gs://my-bucket/fuzzball.js");
with data as (select "my_test_string" as a, "my_other_string" as b)
SELECT a, b, token_set_ratio(a, b) from data
尝试 Flookup 获取 Google 表格...它绝对比 Levenshtein 距离快,而且它可以开箱即用地计算相似度百分比。
您可能会发现有用的一个 Flookup 函数是:
FUZZYMATCH (string1, string2)
参数详情
- string1:与 string2 比较。
- string2:与 string1 比较。
然后根据这些比较计算相似度百分比。两个参数都可以是范围。
我目前正在尝试针对大型数据集对其进行优化,因此非常欢迎您feedback。
编辑:我是 Flookup 的创建者。
准备使用共享 UDF - Levenshtein 距离:
SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
, fhoffa.x.levenshtein('googgle', 'goggles')
, fhoffa.x.levenshtein('is this the', 'Is This The')
6 2 0
Soundex:
SELECT fhoffa.x.soundex('felipe')
, fhoffa.x.soundex('googgle')
, fhoffa.x.soundex('guugle')
F410 G240 G240
模糊二选一:
SELECT fhoffa.x.fuzzy_extract_one('jony'
, (SELECT ARRAY_AGG(name)
FROM `fh-bigquery.popular_names.gender_probabilities`)
#, ['john', 'johnny', 'jonathan', 'jonas']
)
johnny
操作方法:
当我在寻找上面的答案 Felipe 时,我处理了自己的查询并最终得到了两个版本,一个我称之为字符串 approximation 和另一个字符串 相似.
首先是查看源字符串和测试字符串的字母之间的最短距离,returns 是 0 和 1 之间的分数,其中 1 是完全匹配。它将始终根据两者中最长的字符串进行评分。事实证明 return 与 Levensthein 距离的结果相似。
#standardSql
CREATE OR REPLACE FUNCTION `myproject.func.stringApproximation`(sourceString STRING, testString STRING) AS (
(select avg(best_result) from (
select if(length(testString)<length(sourceString), sourceoffset, testoffset) as ref,
case
when min(result) is null then 0
else 1 / (min(result) + 1)
end as best_result,
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)),
greatest(length(testString),length(sourceString))) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
)
);
第二个是第一个的变体,它会查看匹配距离序列,因此与前面或后面的字符距离相等的匹配字符将计为一个点。这工作得很好,比字符串近似更好,但不如我想的那么好(见下面的示例输出)。
#standarSql
CREATE OR REPLACE FUNCTION `myproject.func.stringResemblance`(sourceString STRING, testString STRING) AS (
(
select avg(sequence)
from (
select ref,
if(array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.before))) > 0
or array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.after))) > 0
, 1, 0) as sequence
from (
select ref,
collection,
lag(collection) over (order by ref) as before,
lead(collection) over (order by ref) as after
from (
select if(length(testString) < length(sourceString), sourceoffset, testoffset) as ref,
array_agg(result ignore nulls) as collection
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)), null) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
) as comparison
)
)
);
下面是结果示例:
#standardSQL
with test_subjects as (
select 'benji' as name union all
select 'benjamin' union all
select 'benjamin alan artis' union all
select 'ben artis' union all
select 'artis benjamin'
)
select name, quick.stringApproximation('benjamin artis', name) as approxiamtion, quick.stringResemblance('benjamin artis', name) as resemblance
from test_subjects
order by resemblance desc
这个returns
+---------------------+--------------------+--------------------+
| name | approximation | resemblance |
+---------------------+--------------------+--------------------+
| artis benjamin | 0.2653061224489796 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| benjamin alan artis | 0.6078947368421053 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| ben artis | 0.4142857142857142 | 0.7142857142857143 |
+---------------------+--------------------+--------------------+
| benjamin | 0.6125850340136053 | 0.5714285714285714 |
+---------------------+--------------------+--------------------+
| benji | 0.36269841269841263| 0.28571428571428575|
+----------------------------------------------------------------
已编辑:更新了相似度算法以改进结果。
我did it喜欢这样:
CREATE TEMP FUNCTION trigram_similarity(a STRING, b STRING) AS (
(
WITH a_trigrams AS (
SELECT
DISTINCT tri_a
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(a), ''), [3,3])) AS tri_a
),
b_trigrams AS (
SELECT
DISTINCT tri_b
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(b), ''), [3,3])) AS tri_b
)
SELECT
COUNTIF(tri_b IS NOT NULL) / COUNT(*)
FROM
a_trigrams
LEFT JOIN b_trigrams ON tri_a = tri_b
)
);
这是与 Postgres's pg_trgm 的比较:
select trigram_similarity('saemus', 'seamus');
-- 0.25 vs. pg_trgm 0.272727
select trigram_similarity('shamus', 'seamus');
-- 0.5 vs. pg_trgm 0.4
我在
上给出了相同的答案
我想知道是否有人知道在 BigQuery 中测量字符串相似性的方法。
似乎是一个不错的功能。
我的情况是我需要比较两个 url 的相似性,因为我想相当确定它们指的是同一篇文章。
我可以找到 examples using javascript 所以也许 UDF 是可行的方法,但我根本没有使用过 UDF(或者 javascript 就此而言 :))
只是想知道是否可以使用现有的正则表达式函数,或者是否有人可以帮助我开始将 javascript 示例移植到 UDF 中。
非常感谢任何帮助,谢谢
编辑:添加一些示例代码
所以如果我有一个 UDF 定义为:
// distance function
function levenshteinDistance (row, emit) {
//if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
if (typeof row.inputA === 'undefined') {var myresult = 1};
if (typeof row.inputB === 'undefined') {var myresult = 1};
//if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};
var myresult = Math.min(
levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
) + 1;
emit({outputA: myresult})
}
bigquery.defineFunction(
'levenshteinDistance', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
levenshteinDistance // Reference to JavaScript UDF
);
// make a test function to test individual parts
function test(row, emit) {
if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
emit({outputA: x});
}
bigquery.defineFunction(
'test', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
test // Reference to JavaScript UDF
);
我尝试使用以下查询进行测试:
SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))
我收到错误:
错误:TypeError:无法读取第 11 行第 38-39 列未定义的 属性 'substr' 错误位置:用户自定义函数
似乎 row.inputA 可能不是字符串,或者由于某种原因字符串函数无法处理它。不确定这是类型问题还是关于 UDF 默认能够使用的实用程序的有趣问题。
再次感谢您的帮助,谢谢。
通过 JS Levenshtein 将是可行的方法。您可以使用该算法来获得绝对字符串距离,或者通过简单地计算 abs(strlen - distance / strlen).
最简单的实现方法是定义一个 Levenshtein UDF,它接受两个输入 a 和 b,并计算它们之间的距离。该函数可以 return a、b 和距离。
要调用它,您需要将两个 URL 作为别名为 'a' 和 'b':
的列传递SELECT a, b, distance
FROM
Levenshtein(
SELECT
some_url AS a, other_url AS b
FROM
your_table
)
我找不到这个问题的直接答案,所以我提出了这个标准的解决方案 SQL
#standardSQL
CREATE TEMP FUNCTION HammingDistance(a STRING, b STRING) AS (
(
SELECT
SUM(counter) AS diff
FROM (
SELECT
CASE
WHEN X.value != Y.value THEN 1
ELSE 0
END AS counter
FROM (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(a, "")) AS value ) X
JOIN (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(b, "")) AS value ) Y
ON
X.row = Y.row )
)
);
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT strings, 'abcdef' as target, HammingDistance('abcdef', strings) as hamming_distance
FROM Input;
与其他解决方案(
下面是使用 WITH OFFSET
而不是 ROW_NUMBER() OVER()
#standardSQL
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT 'abcdef' AS target, strings,
(SELECT COUNT(1)
FROM UNNEST(SPLIT('abcdef', '')) a WITH OFFSET x
JOIN UNNEST(SPLIT(strings, '')) b WITH OFFSET y
ON x = y AND a != b) hamming_distance
FROM Input
如果您熟悉 Python,则可以使用从 GCS 加载的外部库在 BigQuery 中使用 fuzzywuzzy 定义的函数。
步骤:
- 下载 javascript 版本的 fuzzywuzzy (fuzzball)
- 取库的编译文件:dist/fuzzball.umd.min.js并重命名为更清晰的名称(如
fuzzball
) - 将其上传到 google 云存储桶
- 创建一个临时函数以在查询中使用库(将 OPTIONS 中的路径设置为相关路径)
CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
library="gs://my-bucket/fuzzball.js");
with data as (select "my_test_string" as a, "my_other_string" as b)
SELECT a, b, token_set_ratio(a, b) from data
尝试 Flookup 获取 Google 表格...它绝对比 Levenshtein 距离快,而且它可以开箱即用地计算相似度百分比。 您可能会发现有用的一个 Flookup 函数是:
FUZZYMATCH (string1, string2)
参数详情
- string1:与 string2 比较。
- string2:与 string1 比较。
然后根据这些比较计算相似度百分比。两个参数都可以是范围。
我目前正在尝试针对大型数据集对其进行优化,因此非常欢迎您feedback。
编辑:我是 Flookup 的创建者。
准备使用共享 UDF - Levenshtein 距离:
SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
, fhoffa.x.levenshtein('googgle', 'goggles')
, fhoffa.x.levenshtein('is this the', 'Is This The')
6 2 0
Soundex:
SELECT fhoffa.x.soundex('felipe')
, fhoffa.x.soundex('googgle')
, fhoffa.x.soundex('guugle')
F410 G240 G240
模糊二选一:
SELECT fhoffa.x.fuzzy_extract_one('jony'
, (SELECT ARRAY_AGG(name)
FROM `fh-bigquery.popular_names.gender_probabilities`)
#, ['john', 'johnny', 'jonathan', 'jonas']
)
johnny
操作方法:
当我在寻找上面的答案 Felipe 时,我处理了自己的查询并最终得到了两个版本,一个我称之为字符串 approximation 和另一个字符串 相似.
首先是查看源字符串和测试字符串的字母之间的最短距离,returns 是 0 和 1 之间的分数,其中 1 是完全匹配。它将始终根据两者中最长的字符串进行评分。事实证明 return 与 Levensthein 距离的结果相似。
#standardSql
CREATE OR REPLACE FUNCTION `myproject.func.stringApproximation`(sourceString STRING, testString STRING) AS (
(select avg(best_result) from (
select if(length(testString)<length(sourceString), sourceoffset, testoffset) as ref,
case
when min(result) is null then 0
else 1 / (min(result) + 1)
end as best_result,
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)),
greatest(length(testString),length(sourceString))) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
)
);
第二个是第一个的变体,它会查看匹配距离序列,因此与前面或后面的字符距离相等的匹配字符将计为一个点。这工作得很好,比字符串近似更好,但不如我想的那么好(见下面的示例输出)。
#standarSql
CREATE OR REPLACE FUNCTION `myproject.func.stringResemblance`(sourceString STRING, testString STRING) AS (
(
select avg(sequence)
from (
select ref,
if(array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.before))) > 0
or array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.after))) > 0
, 1, 0) as sequence
from (
select ref,
collection,
lag(collection) over (order by ref) as before,
lead(collection) over (order by ref) as after
from (
select if(length(testString) < length(sourceString), sourceoffset, testoffset) as ref,
array_agg(result ignore nulls) as collection
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)), null) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
) as comparison
)
)
);
下面是结果示例:
#standardSQL
with test_subjects as (
select 'benji' as name union all
select 'benjamin' union all
select 'benjamin alan artis' union all
select 'ben artis' union all
select 'artis benjamin'
)
select name, quick.stringApproximation('benjamin artis', name) as approxiamtion, quick.stringResemblance('benjamin artis', name) as resemblance
from test_subjects
order by resemblance desc
这个returns
+---------------------+--------------------+--------------------+
| name | approximation | resemblance |
+---------------------+--------------------+--------------------+
| artis benjamin | 0.2653061224489796 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| benjamin alan artis | 0.6078947368421053 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| ben artis | 0.4142857142857142 | 0.7142857142857143 |
+---------------------+--------------------+--------------------+
| benjamin | 0.6125850340136053 | 0.5714285714285714 |
+---------------------+--------------------+--------------------+
| benji | 0.36269841269841263| 0.28571428571428575|
+----------------------------------------------------------------
已编辑:更新了相似度算法以改进结果。
我did it喜欢这样:
CREATE TEMP FUNCTION trigram_similarity(a STRING, b STRING) AS (
(
WITH a_trigrams AS (
SELECT
DISTINCT tri_a
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(a), ''), [3,3])) AS tri_a
),
b_trigrams AS (
SELECT
DISTINCT tri_b
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(b), ''), [3,3])) AS tri_b
)
SELECT
COUNTIF(tri_b IS NOT NULL) / COUNT(*)
FROM
a_trigrams
LEFT JOIN b_trigrams ON tri_a = tri_b
)
);
这是与 Postgres's pg_trgm 的比较:
select trigram_similarity('saemus', 'seamus');
-- 0.25 vs. pg_trgm 0.272727
select trigram_similarity('shamus', 'seamus');
-- 0.5 vs. pg_trgm 0.4
我在