这已经是一个字符串相似度算法了吗？

Is this already a string similarity algorithm?

除了 Levenshtein Distance 之外，我不熟悉字符串相似度算法，因为我正在使用它，但结果证明它不太理想。

所以我有点想实现一种递归算法，但我想知道它是否已经存在，以便我可以利用其他人的专业知识。

算法示例如下：

string 1: "Paul Johnson"

string 2: "John Paulson"

第 1 步：找到所有最长的匹配项

Match 1: "Paul"

Match 2: "John"

Match 3: "son"

Match 4: " "

第 2 步：使用以下公式计算每个匹配项的得分：((match.len/string.len)*match.len)字符串的长度。

Match 1: (4/12)*4 = 1.333...

Match 2: 1.333...

Match 3: .75

Match 4: .083

第 3 步：在更大范围内执行第 1 步和第 2 步，（匹配匹配。）这个我还没有完全弄清楚。但我的想法是，如果 "son" 出现在 "Paul John" 之后，它出现在 "John Paul" 之后，那么这应该算作一些事情。

第 4 步：对所有已计算的分数求和。

Scores: 1.333 + 1.333 + .75 + .083333 = 3.4999... (plus whatever scores step 3 produces)

有没有人觉得这很眼熟？我希望其他人已经按照这些思路实际制作了一个算法，这样我就不必自己弄清楚了。

您描述的内容有点类似于以下论文中所说的最长公共子串 (LCS)。简要说明和与其他算法的比较： A Comparison of Personal Name Matching

This algorithm [11] repeatedly finds and removes the longest common sub-string in the two strings compared, up to a minimum lengths (normally set to 2 or 3).

...
A similarity measure can be calculated by dividing the total length of the common sub-strings by the minimum, maximum or average lengths of the two original strings (similar to Smith-Waterman).

...

this algorithm is suitable for compound names that have words (like given- and surname) swapped.

这已经是一个字符串相似度算法了吗？

Is this already a string similarity algorithm?

string

algorithm

similarity

string-comparison

sentence-similarity