区分编辑距离
Discriminate edit distance
levenshtein 编辑距离只关心完成了多少次编辑而不关心它们到底是什么,因此以下两对将具有相同的编辑距离。
("A P Moller - Maersk A", "A.P. Moller - Maersk A/S Class A")
("A P Moller - Maersk A", "A.P. Moller - Maersk A/S Class B")
是否有任何算法或库可以区分这两对?
您可以使用 jellyfish
库来获得不同的文本相似度。
In [85]: a = ("A P Moller - Maersk A", "A.P. Moller - Maersk A/S Class A")
...: b = ("A P Moller - Maersk A", "A.P. Moller - Maersk A/S Class B")
In [86]: import jellyfish
In [87]: jellyfish.levenshtein_distance(" ".join(a), " ".join(b))
Out[87]: 1
In [88]: jellyfish.jaro_distance(" ".join(a), " ".join(b))
Out[88]: 0.9876543209876543
In [89]: jellyfish.hamming_distance(" ".join(a), " ".join(b))
Out[89]: 1
In [90]: jellyfish.jaro_winkler_similarity(" ".join(a), " ".join(b))
Out[90]: 0.9925925925925926
您可以使用余弦相似度来查找文本之间的相似度,它会在这两个文本之间产生不同的相似度
import math
import re
from collections import Counter
WORD = re.compile(r"\w+")
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
x =("A P Moller - Maersk A", "A.P. Moller - Maersk A/S Class A")
y =("A P Moller - Maersk A", "A.P. Moller - Maersk A/S Class B")
cosine = get_cosine(text_to_vector(x[0]), text_to_vector(x[1]))
print("Cosine1:", cosine)
cosine1 = get_cosine(text_to_vector(y[0]), text_to_vector(y[1]))
print("Cosine2:", cosine1)
输出:
Cosine1: 0.9091372900969896
Cosine2: 0.8366600265340756
levenshtein 编辑距离只关心完成了多少次编辑而不关心它们到底是什么,因此以下两对将具有相同的编辑距离。
("A P Moller - Maersk A", "A.P. Moller - Maersk A/S Class A")
("A P Moller - Maersk A", "A.P. Moller - Maersk A/S Class B")
是否有任何算法或库可以区分这两对?
您可以使用 jellyfish
库来获得不同的文本相似度。
In [85]: a = ("A P Moller - Maersk A", "A.P. Moller - Maersk A/S Class A")
...: b = ("A P Moller - Maersk A", "A.P. Moller - Maersk A/S Class B")
In [86]: import jellyfish
In [87]: jellyfish.levenshtein_distance(" ".join(a), " ".join(b))
Out[87]: 1
In [88]: jellyfish.jaro_distance(" ".join(a), " ".join(b))
Out[88]: 0.9876543209876543
In [89]: jellyfish.hamming_distance(" ".join(a), " ".join(b))
Out[89]: 1
In [90]: jellyfish.jaro_winkler_similarity(" ".join(a), " ".join(b))
Out[90]: 0.9925925925925926
您可以使用余弦相似度来查找文本之间的相似度,它会在这两个文本之间产生不同的相似度
import math
import re
from collections import Counter
WORD = re.compile(r"\w+")
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
x =("A P Moller - Maersk A", "A.P. Moller - Maersk A/S Class A")
y =("A P Moller - Maersk A", "A.P. Moller - Maersk A/S Class B")
cosine = get_cosine(text_to_vector(x[0]), text_to_vector(x[1]))
print("Cosine1:", cosine)
cosine1 = get_cosine(text_to_vector(y[0]), text_to_vector(y[1]))
print("Cosine2:", cosine1)
输出:
Cosine1: 0.9091372900969896
Cosine2: 0.8366600265340756