如何检查具有自定义容差级别的字符串中是否出现了相似的子字符串

How to check if a similar sub-string has appeared in string with customise tolerance level

如何检查一个替换是否在具有特定编辑距离公差的字符串内。例如:

str = 'Python is a multi-paradigm, dynamically typed, multipurpose programming language, designed to be quick (to learn, to use, and to understand), and to enforce a clean and uniform syntax.'
substr1 = 'ython'
substr2 = 'thon'
substr3 = 'cython'
edit_distance_tolerance = 1

substr_in_str(str, substr1, edit_distance_tolerance)
>> True

substr_in_str(str, substr2, edit_distance_tolerance)
>> False

substr_in_str(str, substr3, edit_distance_tolerance)
>> True

我尝试了什么: 我试着用单词打断字符串并删除特殊字符,然后一个一个地进行比较,但是性能(在速度和准确性方面)不太好。

答案并不像您想象的那么简单,您需要大量的数学知识才能实现这一点,而标准的 re(regex) 库无法解决这个问题。我认为 TRE 库已经在很大程度上解决了这个问题,请参阅此处 https://github.com/laurikari/tre/

这是我想出的递归解决方案,希望它是正确的:

def substr_in_str_word(string, substr, edit_distance_tolerance):

    if edit_distance_tolerance<0:
        return False

    if len(substr) == 0:
        return True

    if len(string) == 0:
        return False

    for s1 in string:
        for s2 in substr:
            if s1==s2:
                return substr_in_str(string[1:],substr[1:], edit_distance_tolerance)
            else:
                return substr_in_str(string[1:],substr[1:], edit_distance_tolerance-1) or \
            substr_in_str(string[1:],substr[1:], edit_distance_tolerance-1) or\
            substr_in_str(string[1:],substr, edit_distance_tolerance-1) or \
            substr_in_str(string,substr[1:], edit_distance_tolerance-1)


def substr_in_str(string, substr, edit_distance_tolerance):
    for word in string.split(' '):
        if substr_in_str_word(word, substr, edit_distance_tolerance):
            return True
    return False          

测试:

str = 'Python is a multi-paradigm'
substr1 = 'ython'
substr2 = 'thon'
substr3 = 'cython'

edit_distance_tolerance = 1

print(substr_in_str(str, substr1, edit_distance_tolerance))
print(substr_in_str(str, substr2, edit_distance_tolerance))
print(substr_in_str(str, substr3, edit_distance_tolerance))

输出:

True
False
True