如何检查具有自定义容差级别的字符串中是否出现了相似的子字符串
How to check if a similar sub-string has appeared in string with customise tolerance level
如何检查一个替换是否在具有特定编辑距离公差的字符串内。例如:
str = 'Python is a multi-paradigm, dynamically typed, multipurpose programming language, designed to be quick (to learn, to use, and to understand), and to enforce a clean and uniform syntax.'
substr1 = 'ython'
substr2 = 'thon'
substr3 = 'cython'
edit_distance_tolerance = 1
substr_in_str(str, substr1, edit_distance_tolerance)
>> True
substr_in_str(str, substr2, edit_distance_tolerance)
>> False
substr_in_str(str, substr3, edit_distance_tolerance)
>> True
我尝试了什么:
我试着用单词打断字符串并删除特殊字符,然后一个一个地进行比较,但是性能(在速度和准确性方面)不太好。
答案并不像您想象的那么简单,您需要大量的数学知识才能实现这一点,而标准的 re(regex) 库无法解决这个问题。我认为 TRE 库已经在很大程度上解决了这个问题,请参阅此处 https://github.com/laurikari/tre/
这是我想出的递归解决方案,希望它是正确的:
def substr_in_str_word(string, substr, edit_distance_tolerance):
if edit_distance_tolerance<0:
return False
if len(substr) == 0:
return True
if len(string) == 0:
return False
for s1 in string:
for s2 in substr:
if s1==s2:
return substr_in_str(string[1:],substr[1:], edit_distance_tolerance)
else:
return substr_in_str(string[1:],substr[1:], edit_distance_tolerance-1) or \
substr_in_str(string[1:],substr[1:], edit_distance_tolerance-1) or\
substr_in_str(string[1:],substr, edit_distance_tolerance-1) or \
substr_in_str(string,substr[1:], edit_distance_tolerance-1)
def substr_in_str(string, substr, edit_distance_tolerance):
for word in string.split(' '):
if substr_in_str_word(word, substr, edit_distance_tolerance):
return True
return False
测试:
str = 'Python is a multi-paradigm'
substr1 = 'ython'
substr2 = 'thon'
substr3 = 'cython'
edit_distance_tolerance = 1
print(substr_in_str(str, substr1, edit_distance_tolerance))
print(substr_in_str(str, substr2, edit_distance_tolerance))
print(substr_in_str(str, substr3, edit_distance_tolerance))
输出:
True
False
True
如何检查一个替换是否在具有特定编辑距离公差的字符串内。例如:
str = 'Python is a multi-paradigm, dynamically typed, multipurpose programming language, designed to be quick (to learn, to use, and to understand), and to enforce a clean and uniform syntax.'
substr1 = 'ython'
substr2 = 'thon'
substr3 = 'cython'
edit_distance_tolerance = 1
substr_in_str(str, substr1, edit_distance_tolerance)
>> True
substr_in_str(str, substr2, edit_distance_tolerance)
>> False
substr_in_str(str, substr3, edit_distance_tolerance)
>> True
我尝试了什么: 我试着用单词打断字符串并删除特殊字符,然后一个一个地进行比较,但是性能(在速度和准确性方面)不太好。
答案并不像您想象的那么简单,您需要大量的数学知识才能实现这一点,而标准的 re(regex) 库无法解决这个问题。我认为 TRE 库已经在很大程度上解决了这个问题,请参阅此处 https://github.com/laurikari/tre/
这是我想出的递归解决方案,希望它是正确的:
def substr_in_str_word(string, substr, edit_distance_tolerance):
if edit_distance_tolerance<0:
return False
if len(substr) == 0:
return True
if len(string) == 0:
return False
for s1 in string:
for s2 in substr:
if s1==s2:
return substr_in_str(string[1:],substr[1:], edit_distance_tolerance)
else:
return substr_in_str(string[1:],substr[1:], edit_distance_tolerance-1) or \
substr_in_str(string[1:],substr[1:], edit_distance_tolerance-1) or\
substr_in_str(string[1:],substr, edit_distance_tolerance-1) or \
substr_in_str(string,substr[1:], edit_distance_tolerance-1)
def substr_in_str(string, substr, edit_distance_tolerance):
for word in string.split(' '):
if substr_in_str_word(word, substr, edit_distance_tolerance):
return True
return False
测试:
str = 'Python is a multi-paradigm'
substr1 = 'ython'
substr2 = 'thon'
substr3 = 'cython'
edit_distance_tolerance = 1
print(substr_in_str(str, substr1, edit_distance_tolerance))
print(substr_in_str(str, substr2, edit_distance_tolerance))
print(substr_in_str(str, substr3, edit_distance_tolerance))
输出:
True
False
True