模糊加入 Levenshtein 距离
Fuzzy join with Levenshtein distance
我有一个 table 包含名为“potential_users”的用户名(约 1000 行),另一个名为“actual_users”(约 1000 万行)。所有记录全部由[a-z]字符组成,没有白色space。此外,我知道 potential_users 中的 none 在 actual_users table 中。
我希望能够根据 Levenshtein 距离为 potential_users 中的每一行计算出 actual_users 中最接近的记录。例如:
| potential_users|
|----------------|
| user1 |
| kajd |
| bbbbb |
和
| actual_users |
|--------------|
| kaj |
| bbbbbbb |
| user |
会 return:
| potential_users | actual_users | levenshtein_distance |
|-----------------|--------------|----------------------|
| user1 | user | 1 |
| kajd | kaj | 1 |
| bbbbb | bbbbbbb | 2 |
如果 table 很短,我可以创建一个交叉连接,计算 potential_users 中的每条记录的 Levenshtein 距离 actual_users,然后是 return具有最低值的那个。但是,在我的例子中,这将创建一个 1 000 x 10 000 000 行的中介 table,这有点不切实际。
是否有更简洁的方法来创建交叉连接来执行此类操作?
我认为你不能用简单的连接来做到这一点,有完整的算法来计算它。看看这篇文章显示 Levenshtein 距离算法在 sql:
中的实现
https://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540&whichpage=1
不幸的是,没有交叉连接就无法做到这一点。归根结底,每个潜在用户都需要针对每个实际用户进行测试。
然而,Trino (formerly known as Presto SQL) 将在许多线程和机器上并行执行连接,因此在足够的硬件条件下它可以非常快速地执行。请注意,在 Trino 中,中间结果从一个运算符流式传输到另一个运算符,因此对于此查询,没有包含 10M x 1k 行的“中间 table”。
对于像
这样的查询
SELECT potential, min_by(actual, distance), min(distance)
FROM (
SELECT *, levenshtein_distance(potential, actual) distance
FROM actual_users, potential_users
)
GROUP BY potential
这是查询计划:
Query Plan
----------------------------------------------------------------------------------------------------------------
Fragment 0 [SINGLE]
Output layout: [potential, min_by, min]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
Output[potential, _col1, _col2]
│ Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ _col1 := min_by
│ _col2 := min
└─ RemoteSource[1]
Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]
Fragment 1 [HASH]
Output layout: [potential, min_by, min]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
Aggregate(FINAL)[potential]
│ Layout: [potential:varchar(5), min:bigint, min_by:varchar(7)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ min := min("min_1")
│ min_by := min_by("min_by_0")
└─ LocalExchange[HASH] ("potential")
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
└─ RemoteSource[2]
Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
Fragment 2 [SOURCE]
Output layout: [potential, min_1, min_by_0]
Output partitioning: HASH [potential]
Stage Execution Strategy: UNGROUPED_EXECUTION
Aggregate(PARTIAL)[potential]
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ min_1 := min("levenshtein_distance")
│ min_by_0 := min_by("actual", "levenshtein_distance")
└─ Project[]
│ Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ levenshtein_distance := levenshtein_distance("potential", "actual")
└─ CrossJoin
│ Layout: [actual:varchar(7), potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: REPLICATED
├─ TableScan[memory:9, grouped = false]
│ Layout: [actual:varchar(7)]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}
│ actual := 0
└─ LocalExchange[SINGLE] ()
│ Layout: [potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}
└─ RemoteSource[3]
Layout: [potential:varchar(5)]
Fragment 3 [SOURCE]
Output layout: [potential]
Output partitioning: BROADCAST []
Stage Execution Strategy: UNGROUPED_EXECUTION
TableScan[memory:8, grouped = false]
Layout: [potential:varchar(5)]
Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}
potential := 0
(1 row)
特别是,对于这一部分,一旦交叉连接产生一行,它就会被送入计算两个值之间的 Levenshtein 距离的投影运算符,然后送入只存储一个值的聚合每个“潜在”用户的组。因此,此查询所需的内存量应该很小。
Aggregate(PARTIAL)[potential]
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ min_1 := min("levenshtein_distance")
│ min_by_0 := min_by("actual", "levenshtein_distance")
└─ Project[]
│ Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ levenshtein_distance := levenshtein_distance("potential", "actual")
└─ CrossJoin
│ Layout: [actual:varchar(7), potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: REPLICATED
我有一个 table 包含名为“potential_users”的用户名(约 1000 行),另一个名为“actual_users”(约 1000 万行)。所有记录全部由[a-z]字符组成,没有白色space。此外,我知道 potential_users 中的 none 在 actual_users table 中。
我希望能够根据 Levenshtein 距离为 potential_users 中的每一行计算出 actual_users 中最接近的记录。例如:
| potential_users|
|----------------|
| user1 |
| kajd |
| bbbbb |
和
| actual_users |
|--------------|
| kaj |
| bbbbbbb |
| user |
会 return:
| potential_users | actual_users | levenshtein_distance |
|-----------------|--------------|----------------------|
| user1 | user | 1 |
| kajd | kaj | 1 |
| bbbbb | bbbbbbb | 2 |
如果 table 很短,我可以创建一个交叉连接,计算 potential_users 中的每条记录的 Levenshtein 距离 actual_users,然后是 return具有最低值的那个。但是,在我的例子中,这将创建一个 1 000 x 10 000 000 行的中介 table,这有点不切实际。
是否有更简洁的方法来创建交叉连接来执行此类操作?
我认为你不能用简单的连接来做到这一点,有完整的算法来计算它。看看这篇文章显示 Levenshtein 距离算法在 sql:
中的实现https://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540&whichpage=1
不幸的是,没有交叉连接就无法做到这一点。归根结底,每个潜在用户都需要针对每个实际用户进行测试。
然而,Trino (formerly known as Presto SQL) 将在许多线程和机器上并行执行连接,因此在足够的硬件条件下它可以非常快速地执行。请注意,在 Trino 中,中间结果从一个运算符流式传输到另一个运算符,因此对于此查询,没有包含 10M x 1k 行的“中间 table”。
对于像
这样的查询SELECT potential, min_by(actual, distance), min(distance)
FROM (
SELECT *, levenshtein_distance(potential, actual) distance
FROM actual_users, potential_users
)
GROUP BY potential
这是查询计划:
Query Plan
----------------------------------------------------------------------------------------------------------------
Fragment 0 [SINGLE]
Output layout: [potential, min_by, min]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
Output[potential, _col1, _col2]
│ Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ _col1 := min_by
│ _col2 := min
└─ RemoteSource[1]
Layout: [potential:varchar(5), min_by:varchar(7), min:bigint]
Fragment 1 [HASH]
Output layout: [potential, min_by, min]
Output partitioning: SINGLE []
Stage Execution Strategy: UNGROUPED_EXECUTION
Aggregate(FINAL)[potential]
│ Layout: [potential:varchar(5), min:bigint, min_by:varchar(7)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ min := min("min_1")
│ min_by := min_by("min_by_0")
└─ LocalExchange[HASH] ("potential")
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
└─ RemoteSource[2]
Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
Fragment 2 [SOURCE]
Output layout: [potential, min_1, min_by_0]
Output partitioning: HASH [potential]
Stage Execution Strategy: UNGROUPED_EXECUTION
Aggregate(PARTIAL)[potential]
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ min_1 := min("levenshtein_distance")
│ min_by_0 := min_by("actual", "levenshtein_distance")
└─ Project[]
│ Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ levenshtein_distance := levenshtein_distance("potential", "actual")
└─ CrossJoin
│ Layout: [actual:varchar(7), potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: REPLICATED
├─ TableScan[memory:9, grouped = false]
│ Layout: [actual:varchar(7)]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}
│ actual := 0
└─ LocalExchange[SINGLE] ()
│ Layout: [potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: ?}
└─ RemoteSource[3]
Layout: [potential:varchar(5)]
Fragment 3 [SOURCE]
Output layout: [potential]
Output partitioning: BROADCAST []
Stage Execution Strategy: UNGROUPED_EXECUTION
TableScan[memory:8, grouped = false]
Layout: [potential:varchar(5)]
Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}
potential := 0
(1 row)
特别是,对于这一部分,一旦交叉连接产生一行,它就会被送入计算两个值之间的 Levenshtein 距离的投影运算符,然后送入只存储一个值的聚合每个“潜在”用户的组。因此,此查询所需的内存量应该很小。
Aggregate(PARTIAL)[potential]
│ Layout: [potential:varchar(5), min_1:bigint, min_by_0:row(boolean, boolean, bigint, varchar(7))]
│ min_1 := min("levenshtein_distance")
│ min_by_0 := min_by("actual", "levenshtein_distance")
└─ Project[]
│ Layout: [actual:varchar(7), potential:varchar(5), levenshtein_distance:bigint]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ levenshtein_distance := levenshtein_distance("potential", "actual")
└─ CrossJoin
│ Layout: [actual:varchar(7), potential:varchar(5)]
│ Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
│ Distribution: REPLICATED