检查当前单词是否接近字符串中的单词的有效方法是什么?

What is efficient way to check if current word is close to a word in string?

考虑以下示例:

  1. 示例 1:

    str1 = "wow...it  looks amazing"
    str2 = "looks amazi"
    

    你看到 amazi 接近 amazingstr2 打错了,我想写一个程序告诉我 amazi 接近 amazing 然后在 str2 我将用 amazing

  2. 替换 amazi
  3. 示例 2:

    str1 = "is looking good"
    str2 = "looks goo"
    

    在这种情况下更新的 str2 将是 "looking good"

  4. 示例 3:

    str1 = "you are really looking good"
    str2 = "lok goo"
    

    在这种情况下 str2 将是 "good",因为 lok 不接近 looking(或者即使程序可以在这种情况下转换 loklooking 就可以解决我的问题了)

  5. 示例 4:

    str1 = "Stu is actually SEVERLY sunburnt....it hurts!!!"
    str2 = "hurts!!"
    

    更新后的 str2 将是 "hurts!!!"

  6. 示例 5:

    str1 = "you guys were absolutely amazing tonight, a..."
    str2 = "ly amazin"
    

    已更新 str2 将是 "amazing""ly" 应被删除或替换为绝对。

算法和代码是什么?

也许我们可以通过按字典顺序查看字符并设置一个 阈值为 0.8 或 80%,因此如果 word2str1 中获取 word1 的 80% 连续字符,那么我们将 str2 中的 word2 替换为 [=38 的单词=]? 请使用 python 代码的任何其他有效解决方案?

在这种情况下,您可以使用 Jacard 系数。首先,您需要将第一个和第二个字符串分开 space。之后,对于 str2 中的每个字符串,对 str1 中的每个字符串取 Jacard 系数,然后用 Jacard 系数最高的字符串替换。

您可以使用 sklearn.metrics.jaccard_score.

像这样:

str1 = "wow...it looks amazing"
str2 =  "looks amazi"
str3 = []

# Checking for similar strings in both strings:
for n in str1.split():
    for m in str2.split():
        if m in n:
            str3.append(n)

# If found 2 similar strings:
if len(str3) == 2:
    # If their indexes align:
    if str1.split().index(str3[1]) - str1.split().index(str3[0]) == 1:
        print(' '.join(str3))

elif len(str3) == 1:
    print(str3[0])

输出:

looks amazing

根据 OP 给出的条件进行更新:

str1 = "good..."
str2 =  "god.."
str3 = []

# Checking for similar strings in both strings:
for n in str1.split():
    for m in str2.split():

        # Calculating matching character in the 2 words:
        c = ''
        for i in m:
            if i in n:
                c+=i
        # If the amount of matching characters is greater or equal to 50% the length of the larger word
        # or the smaller word is in the larger word:
        if len(list(c)) >= len(n)*0.50 or m in n:
            str3.append(n)


# If found 2 similar strings:
if len(str3) == 2:
    # If their indexes align:
    if str1.split().index(str3[1]) - str1.split().index(str3[0]) == 1:
        print(' '.join(str3))

elif len(str3) == 1:
    print(str3[0])

有很多方法可以解决这个问题。这个解决了你所有的例子。我仅向 return 添加了一个最小相似度过滤器,仅匹配更高质量的匹配项。这就是允许 'ly' 在最后一个示例中被删除的原因,因为它不是所有关闭任何单词的全部。

Documentation

您可以使用 pip install python-Levenshtein

安装 levenshtein
import Levenshtein

def find_match(str1,str2):
    min_similarity = .75
    output = []
    results = [[Levenshtein.jaro_winkler(x,y) for x in str1.split()] for y in str2.split()]
    for x in results:
        if max(x) >= min_similarity:
            output.append(str1.split()[x.index(max(x))])
    return output

您提出的每个样本。

find_match("is looking good", "looks goo")

['looking','good']

find_match("you are really looking good", "lok goo")

['looking','good']

find_match("Stu is actually SEVERLY sunburnt....it hurts!!!", "hurts!!")

['hurts!!!']

find_match("you guys were absolutely amazing tonight, a...", "ly amazin")

['amazing']

我用正则表达式搞定了

def check_regex(str1,str2):
    #New list to store the updated value
    str_new = []
    for i in str2:
        # regular expression for comparing the strings
        x = ['['+i+']','^'+i,i+'$','('+i+')']
        for k in x:
            h=0
            for j in str1:
                #Conditions to make sure the word is close enough to the particular word
                if "".join(re.findall(k,j)) == i or ("".join(re.findall(k,j)) in i and abs(len("".join(re.findall(k,j)))-len(i)) == 1 and len(i)!=2):
                    str_new.append(j)
                    h=1
                    break
            if h==1:
                break
    return str_new
import re
str1 = input().split()
str2 = input().split()
print(" ".join(check_regex(str1,str2)))