如何比较 Python 中两个字符串（英语除外）之间的相似性

Question

我想找出两个字符串之间的相似度例子

string1 = "One"
string2 = "one"

我希望答案在 0 和 1 之间。对于以上两个字符串，我们得到 1。现在我正在使用“Jellyfish”，这是 python 中的一个模块，它具有 jaro_distance() 函数。但缺点是我只能比较两个只包含英文单词和其他特殊字符的字符串。但是我想比较其他语言的两个字符串，比如 Punjabi

string1 = "ਬੁੱਧਵਾਰ"
string2 = "ਬੁੱਧਵਾ"

我尝试了相同的 jaro_distance() 函数，但我得到了

>>score = jellyfish.jaro_distance(unicode(string1), unicode(string2))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

在将它们提供给函数之前，我尝试对它们进行编码和解码。有什么方法可以将 jaro_distance() 用于其他语言，或者是否有任何其他 module/functions 可用于此？你们能帮我解决这个问题吗？

Answer 1

您可以使用内置模块 difflib

中的 SequenceMatcher

代码示例：

import difflib

print(difflib.SequenceMatcher(None, "ਬੁੱਧਵਾਰ", "ਬੁੱਧਵਾ").ratio())

输出：

0.9230769230769231

如何比较 Python 中两个字符串（英语除外）之间的相似性

How to compare similarity between two strings (other than English language) in Python

python

string

unicode

distance

jaro-winkler