检查当前单词是否接近字符串中的单词的有效方法是什么？

Question

考虑以下示例：

示例 1：
```
str1 = "wow...it  looks amazing"
str2 = "looks amazi"
```
你看到 amazi 接近 amazing，str2 打错了，我想写一个程序告诉我 amazi 接近 amazing 然后在 str2 我将用 amazing

amazi

示例 2：
```
str1 = "is looking good"
str2 = "looks goo"
```
在这种情况下更新的 str2 将是 "looking good"
示例 3：
```
str1 = "you are really looking good"
str2 = "lok goo"
```
在这种情况下 str2 将是 "good"，因为 lok 不接近 looking（或者即使程序可以在这种情况下转换 lok到 looking 就可以解决我的问题了)

示例 4：

str1 = "Stu is actually SEVERLY sunburnt....it hurts!!!"
str2 = "hurts!!"

更新后的 str2 将是 "hurts!!!"

示例 5：
```
str1 = "you guys were absolutely amazing tonight, a..."
str2 = "ly amazin"
```
已更新 str2 将是 "amazing"，"ly" 应被删除或替换为绝对。

算法和代码是什么？

也许我们可以通过按字典顺序查看字符并设置一个阈值为 0.8 或 80%，因此如果 word2 从 str1 中获取 word1 的 80% 连续字符，那么我们将 str2 中的 word2 替换为 [=38 的单词=]？请使用 python 代码的任何其他有效解决方案？

Answer 1

在这种情况下，您可以使用 Jacard 系数。首先，您需要将第一个和第二个字符串分开 space。之后，对于 str2 中的每个字符串，对 str1 中的每个字符串取 Jacard 系数，然后用 Jacard 系数最高的字符串替换。

您可以使用 sklearn.metrics.jaccard_score.

Answer 2

像这样：

str1 = "wow...it looks amazing"
str2 =  "looks amazi"
str3 = []

# Checking for similar strings in both strings:
for n in str1.split():
    for m in str2.split():
        if m in n:
            str3.append(n)

# If found 2 similar strings:
if len(str3) == 2:
    # If their indexes align:
    if str1.split().index(str3[1]) - str1.split().index(str3[0]) == 1:
        print(' '.join(str3))

elif len(str3) == 1:
    print(str3[0])

输出：

looks amazing

根据 OP 给出的条件进行更新：

str1 = "good..."
str2 =  "god.."
str3 = []

# Checking for similar strings in both strings:
for n in str1.split():
    for m in str2.split():

        # Calculating matching character in the 2 words:
        c = ''
        for i in m:
            if i in n:
                c+=i
        # If the amount of matching characters is greater or equal to 50% the length of the larger word
        # or the smaller word is in the larger word:
        if len(list(c)) >= len(n)*0.50 or m in n:
            str3.append(n)


# If found 2 similar strings:
if len(str3) == 2:
    # If their indexes align:
    if str1.split().index(str3[1]) - str1.split().index(str3[0]) == 1:
        print(' '.join(str3))

elif len(str3) == 1:
    print(str3[0])

Answer 3

有很多方法可以解决这个问题。这个解决了你所有的例子。我仅向 return 添加了一个最小相似度过滤器，仅匹配更高质量的匹配项。这就是允许 'ly' 在最后一个示例中被删除的原因，因为它不是所有关闭任何单词的全部。

Documentation

您可以使用 pip install python-Levenshtein

安装 levenshtein

import Levenshtein

def find_match(str1,str2):
    min_similarity = .75
    output = []
    results = [[Levenshtein.jaro_winkler(x,y) for x in str1.split()] for y in str2.split()]
    for x in results:
        if max(x) >= min_similarity:
            output.append(str1.split()[x.index(max(x))])
    return output

您提出的每个样本。

find_match("is looking good", "looks goo")

['looking','good']

find_match("you are really looking good", "lok goo")

['looking','good']

find_match("Stu is actually SEVERLY sunburnt....it hurts!!!", "hurts!!")

['hurts!!!']

find_match("you guys were absolutely amazing tonight, a...", "ly amazin")

['amazing']

Answer 4

我用正则表达式搞定了

def check_regex(str1,str2):
    #New list to store the updated value
    str_new = []
    for i in str2:
        # regular expression for comparing the strings
        x = ['['+i+']','^'+i,i+'$','('+i+')']
        for k in x:
            h=0
            for j in str1:
                #Conditions to make sure the word is close enough to the particular word
                if "".join(re.findall(k,j)) == i or ("".join(re.findall(k,j)) in i and abs(len("".join(re.findall(k,j)))-len(i)) == 1 and len(i)!=2):
                    str_new.append(j)
                    h=1
                    break
            if h==1:
                break
    return str_new
import re
str1 = input().split()
str2 = input().split()
print(" ".join(check_regex(str1,str2)))

检查当前单词是否接近字符串中的单词的有效方法是什么？

What is efficient way to check if current word is close to a word in string?

python

string

pattern-matching

stop-words

python-3.x