如何在 Oracle 中使用模糊匹配获得准确的 JOIN
How to get an accurate JOIN using Fuzzy matching in Oracle
我正在尝试将一个 table 中的一组县名与另一个 table 中的县名合并。这里的问题是,两个 table 中的县名都没有规范化。它们的数量不一样;此外,它们可能并不总是以相似的模式出现。例如,"Table A" 中的 'SAINT JOHNS' 县可能在 "Table B" 中表示为 'ST JOHNS'。我们无法预测它们的共同模式。
也就是说,我们不能在加入时使用"equal to"(=
)条件。所以,我正在尝试使用 oracle 中的 JARO_WINKLER_SIMILARITY
函数加入他们。
我的 Left Outer Join 条件如下:
Table_A.State = Table_B.State
AND UTL_MATCH.JARO_WINKLER_SIMILARITY(Table_A.County_Name,Table_B.County_Name)>=80
在对结果进行一些测试后,我给了 80 分,这似乎是最佳的。
在这里,问题是我在加入时设置了 "false Positives"。例如,如果同一州下有一些名称相似的县(例如"BARRY'and "BAY"),则如果度量为>=80
,则将匹配它们。
这会创建不准确的连接数据集。
任何人都可以提出一些解决方法吗?
谢谢,
DAV
Can you plz help me to build a query that will lookup Table_A for each record in Table B/C/D, and match against the county name in A with highest ranked similarity that is >=80
Oracle 设置:
CREATE TABLE official_words ( word ) AS
SELECT 'SAINT JOHNS' FROM DUAL UNION ALL
SELECT 'MONTGOMERY' FROM DUAL UNION ALL
SELECT 'MONROE' FROM DUAL UNION ALL
SELECT 'SAINT JAMES' FROM DUAL UNION ALL
SELECT 'BOTANY BAY' FROM DUAL;
CREATE TABLE words_to_match ( word ) AS
SELECT 'SAINT JOHN' FROM DUAL UNION ALL
SELECT 'ST JAMES' FROM DUAL UNION ALL
SELECT 'MONTGOMERY BAY' FROM DUAL UNION ALL
SELECT 'MONROE ST' FROM DUAL;
查询:
SELECT *
FROM (
SELECT wtm.word,
ow.word AS official_word,
UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) AS similarity,
ROW_NUMBER() OVER ( PARTITION BY wtm.word ORDER BY UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) DESC ) AS rn
FROM words_to_match wtm
INNER JOIN
official_words ow
ON ( UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word )>=80 )
)
WHERE rn = 1;
输出:
WORD OFFICIAL_WO SIMILARITY RN
-------------- ----------- ---------- ----------
MONROE ST MONROE 93 1
MONTGOMERY BAY MONTGOMERY 94 1
SAINT JOHN SAINT JOHNS 98 1
ST JAMES SAINT JAMES 80 1
内联使用一些虚构的测试数据(您可以使用自己的 TABLE_A 和 TABLE_B 代替前两个 with
子句,并从 with matches as ...
开始):
with table_a (state, county_name) as
( select 'A', 'ST JOHNS' from dual union all
select 'A', 'BARRY' from dual union all
select 'B', 'CHEESECAKE' from dual union all
select 'B', 'WAFFLES' from dual union all
select 'C', 'UMBRELLAS' from dual )
, table_b (state, county_name) as
( select 'A', 'SAINT JOHNS' from dual union all
select 'A', 'SAINT JOANS' from dual union all
select 'A', 'BARRY' from dual union all
select 'A', 'BARRIERS' from dual union all
select 'A', 'BANANA' from dual union all
select 'A', 'BANOFFEE' from dual union all
select 'B', 'CHEESE' from dual union all
select 'B', 'CHIPS' from dual union all
select 'B', 'CHICKENS' from dual union all
select 'B', 'WAFFLING' from dual union all
select 'B', 'KITTENS' from dual union all
select 'C', 'PUPPIES' from dual union all
select 'C', 'UMBRIA' from dual union all
select 'C', 'UMBRELLAS' from dual )
, matches as
( select a.state, a.county_name, b.county_name as matched_name
, utl_match.jaro_winkler_similarity(a.county_name,b.county_name) as score
from table_a a
join table_b b on b.state = a.state )
, ranked_matches as
( select m.*
, rank() over (partition by m.state, m.county_name order by m.score desc) as ranking
from matches m
where score > 50 )
select rm.state, rm.county_name, rm. matched_name, rm.score
from ranked_matches rm
where ranking = 1
order by 1,2;
结果:
STATE COUNTY_NAME MATCHED_NAME SCORE
----- ----------- ------------ ----------
A BARRY BARRY 100
A ST JOHNS SAINT JOHNS 80
B CHEESECAKE CHEESE 92
B WAFFLES WAFFLING 86
C UMBRELLAS UMBRELLAS 100
想法是 matches
计算所有得分,ranked_matches
为它们分配一个序列(state
,county_name
),最后的查询选择所有得分最高的球员(即过滤 ranking = 1
)。
您可能仍然会得到一些重复项,因为没有什么可以阻止两个不同的模糊匹配得分相同。
我正在尝试将一个 table 中的一组县名与另一个 table 中的县名合并。这里的问题是,两个 table 中的县名都没有规范化。它们的数量不一样;此外,它们可能并不总是以相似的模式出现。例如,"Table A" 中的 'SAINT JOHNS' 县可能在 "Table B" 中表示为 'ST JOHNS'。我们无法预测它们的共同模式。
也就是说,我们不能在加入时使用"equal to"(=
)条件。所以,我正在尝试使用 oracle 中的 JARO_WINKLER_SIMILARITY
函数加入他们。
我的 Left Outer Join 条件如下:
Table_A.State = Table_B.State
AND UTL_MATCH.JARO_WINKLER_SIMILARITY(Table_A.County_Name,Table_B.County_Name)>=80
在对结果进行一些测试后,我给了 80 分,这似乎是最佳的。
在这里,问题是我在加入时设置了 "false Positives"。例如,如果同一州下有一些名称相似的县(例如"BARRY'and "BAY"),则如果度量为>=80
,则将匹配它们。
这会创建不准确的连接数据集。
任何人都可以提出一些解决方法吗?
谢谢, DAV
Can you plz help me to build a query that will lookup Table_A for each record in Table B/C/D, and match against the county name in A with highest ranked similarity that is >=80
Oracle 设置:
CREATE TABLE official_words ( word ) AS
SELECT 'SAINT JOHNS' FROM DUAL UNION ALL
SELECT 'MONTGOMERY' FROM DUAL UNION ALL
SELECT 'MONROE' FROM DUAL UNION ALL
SELECT 'SAINT JAMES' FROM DUAL UNION ALL
SELECT 'BOTANY BAY' FROM DUAL;
CREATE TABLE words_to_match ( word ) AS
SELECT 'SAINT JOHN' FROM DUAL UNION ALL
SELECT 'ST JAMES' FROM DUAL UNION ALL
SELECT 'MONTGOMERY BAY' FROM DUAL UNION ALL
SELECT 'MONROE ST' FROM DUAL;
查询:
SELECT *
FROM (
SELECT wtm.word,
ow.word AS official_word,
UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) AS similarity,
ROW_NUMBER() OVER ( PARTITION BY wtm.word ORDER BY UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) DESC ) AS rn
FROM words_to_match wtm
INNER JOIN
official_words ow
ON ( UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word )>=80 )
)
WHERE rn = 1;
输出:
WORD OFFICIAL_WO SIMILARITY RN
-------------- ----------- ---------- ----------
MONROE ST MONROE 93 1
MONTGOMERY BAY MONTGOMERY 94 1
SAINT JOHN SAINT JOHNS 98 1
ST JAMES SAINT JAMES 80 1
内联使用一些虚构的测试数据(您可以使用自己的 TABLE_A 和 TABLE_B 代替前两个 with
子句,并从 with matches as ...
开始):
with table_a (state, county_name) as
( select 'A', 'ST JOHNS' from dual union all
select 'A', 'BARRY' from dual union all
select 'B', 'CHEESECAKE' from dual union all
select 'B', 'WAFFLES' from dual union all
select 'C', 'UMBRELLAS' from dual )
, table_b (state, county_name) as
( select 'A', 'SAINT JOHNS' from dual union all
select 'A', 'SAINT JOANS' from dual union all
select 'A', 'BARRY' from dual union all
select 'A', 'BARRIERS' from dual union all
select 'A', 'BANANA' from dual union all
select 'A', 'BANOFFEE' from dual union all
select 'B', 'CHEESE' from dual union all
select 'B', 'CHIPS' from dual union all
select 'B', 'CHICKENS' from dual union all
select 'B', 'WAFFLING' from dual union all
select 'B', 'KITTENS' from dual union all
select 'C', 'PUPPIES' from dual union all
select 'C', 'UMBRIA' from dual union all
select 'C', 'UMBRELLAS' from dual )
, matches as
( select a.state, a.county_name, b.county_name as matched_name
, utl_match.jaro_winkler_similarity(a.county_name,b.county_name) as score
from table_a a
join table_b b on b.state = a.state )
, ranked_matches as
( select m.*
, rank() over (partition by m.state, m.county_name order by m.score desc) as ranking
from matches m
where score > 50 )
select rm.state, rm.county_name, rm. matched_name, rm.score
from ranked_matches rm
where ranking = 1
order by 1,2;
结果:
STATE COUNTY_NAME MATCHED_NAME SCORE
----- ----------- ------------ ----------
A BARRY BARRY 100
A ST JOHNS SAINT JOHNS 80
B CHEESECAKE CHEESE 92
B WAFFLES WAFFLING 86
C UMBRELLAS UMBRELLAS 100
想法是 matches
计算所有得分,ranked_matches
为它们分配一个序列(state
,county_name
),最后的查询选择所有得分最高的球员(即过滤 ranking = 1
)。
您可能仍然会得到一些重复项,因为没有什么可以阻止两个不同的模糊匹配得分相同。