Levenshtein Distance 是如何计算简体中文字符的?
How is Levenshtein Distance calculated on Simplified Chinese characters?
我有 2 个问题:
query1:你好世界
query2:你好
当我 运行 此代码使用 python 库 Levenshtein:
from Levenshtein import distance, hamming, median
lev_edit_dist = distance(query1,query2)
print lev_edit_dist
我的输出是 12。
现在的问题是12这个值是怎么推导出来的?
因为笔画的区别肯定不止12。
根据其documentation,它支持unicode:
It supports both normal and Unicode strings, but can't mix them, all
arguments to a function (method) have to be of the same type (or its
subclasses).
你需要确保汉字是 unicode:
In [1]: from Levenshtein import distance, hamming, median
In [2]: query1 = '你好世界'
In [3]: query2 = '你好'
In [4]: print distance(query1,query2)
6
In [5]: print distance(query1.decode('utf8'),query2.decode('utf8'))
2
我有 2 个问题:
query1:你好世界
query2:你好
当我 运行 此代码使用 python 库 Levenshtein:
from Levenshtein import distance, hamming, median
lev_edit_dist = distance(query1,query2)
print lev_edit_dist
我的输出是 12。 现在的问题是12这个值是怎么推导出来的?
因为笔画的区别肯定不止12。
根据其documentation,它支持unicode:
It supports both normal and Unicode strings, but can't mix them, all arguments to a function (method) have to be of the same type (or its subclasses).
你需要确保汉字是 unicode:
In [1]: from Levenshtein import distance, hamming, median
In [2]: query1 = '你好世界'
In [3]: query2 = '你好'
In [4]: print distance(query1,query2)
6
In [5]: print distance(query1.decode('utf8'),query2.decode('utf8'))
2