检查当前单词是否接近字符串中的单词的有效方法是什么?
What is efficient way to check if current word is close to a word in string?
考虑以下示例:
示例 1:
str1 = "wow...it looks amazing"
str2 = "looks amazi"
你看到 amazi
接近 amazing
,str2
打错了,我想写一个程序告诉我 amazi
接近 amazing
然后在 str2
我将用 amazing
替换 amazi
示例 2:
str1 = "is looking good"
str2 = "looks goo"
在这种情况下更新的 str2
将是 "looking good"
示例 3:
str1 = "you are really looking good"
str2 = "lok goo"
在这种情况下 str2
将是 "good"
,因为 lok
不接近 looking
(或者即使程序可以在这种情况下转换 lok
到 looking
就可以解决我的问题了)
示例 4:
str1 = "Stu is actually SEVERLY sunburnt....it hurts!!!"
str2 = "hurts!!"
更新后的 str2
将是 "hurts!!!"
示例 5:
str1 = "you guys were absolutely amazing tonight, a..."
str2 = "ly amazin"
已更新 str2
将是 "amazing"
,"ly"
应被删除或替换为绝对。
算法和代码是什么?
也许我们可以通过按字典顺序查看字符并设置一个
阈值为 0.8 或 80%,因此如果 word2
从 str1
中获取 word1
的 80% 连续字符,那么我们将 str2
中的 word2
替换为 [=38 的单词=]?
请使用 python 代码的任何其他有效解决方案?
在这种情况下,您可以使用 Jacard 系数。首先,您需要将第一个和第二个字符串分开 space。之后,对于 str2 中的每个字符串,对 str1 中的每个字符串取 Jacard 系数,然后用 Jacard 系数最高的字符串替换。
您可以使用 sklearn.metrics.jaccard_score
.
像这样:
str1 = "wow...it looks amazing"
str2 = "looks amazi"
str3 = []
# Checking for similar strings in both strings:
for n in str1.split():
for m in str2.split():
if m in n:
str3.append(n)
# If found 2 similar strings:
if len(str3) == 2:
# If their indexes align:
if str1.split().index(str3[1]) - str1.split().index(str3[0]) == 1:
print(' '.join(str3))
elif len(str3) == 1:
print(str3[0])
输出:
looks amazing
根据 OP 给出的条件进行更新:
str1 = "good..."
str2 = "god.."
str3 = []
# Checking for similar strings in both strings:
for n in str1.split():
for m in str2.split():
# Calculating matching character in the 2 words:
c = ''
for i in m:
if i in n:
c+=i
# If the amount of matching characters is greater or equal to 50% the length of the larger word
# or the smaller word is in the larger word:
if len(list(c)) >= len(n)*0.50 or m in n:
str3.append(n)
# If found 2 similar strings:
if len(str3) == 2:
# If their indexes align:
if str1.split().index(str3[1]) - str1.split().index(str3[0]) == 1:
print(' '.join(str3))
elif len(str3) == 1:
print(str3[0])
有很多方法可以解决这个问题。这个解决了你所有的例子。我仅向 return 添加了一个最小相似度过滤器,仅匹配更高质量的匹配项。这就是允许 'ly' 在最后一个示例中被删除的原因,因为它不是所有关闭任何单词的全部。
您可以使用 pip install python-Levenshtein
安装 levenshtein
import Levenshtein
def find_match(str1,str2):
min_similarity = .75
output = []
results = [[Levenshtein.jaro_winkler(x,y) for x in str1.split()] for y in str2.split()]
for x in results:
if max(x) >= min_similarity:
output.append(str1.split()[x.index(max(x))])
return output
您提出的每个样本。
find_match("is looking good", "looks goo")
['looking','good']
find_match("you are really looking good", "lok goo")
['looking','good']
find_match("Stu is actually SEVERLY sunburnt....it hurts!!!", "hurts!!")
['hurts!!!']
find_match("you guys were absolutely amazing tonight, a...", "ly amazin")
['amazing']
我用正则表达式搞定了
def check_regex(str1,str2):
#New list to store the updated value
str_new = []
for i in str2:
# regular expression for comparing the strings
x = ['['+i+']','^'+i,i+'$','('+i+')']
for k in x:
h=0
for j in str1:
#Conditions to make sure the word is close enough to the particular word
if "".join(re.findall(k,j)) == i or ("".join(re.findall(k,j)) in i and abs(len("".join(re.findall(k,j)))-len(i)) == 1 and len(i)!=2):
str_new.append(j)
h=1
break
if h==1:
break
return str_new
import re
str1 = input().split()
str2 = input().split()
print(" ".join(check_regex(str1,str2)))
考虑以下示例:
示例 1:
str1 = "wow...it looks amazing" str2 = "looks amazi"
你看到
amazi
接近amazing
,str2
打错了,我想写一个程序告诉我amazi
接近amazing
然后在str2
我将用amazing
替换 示例 2:
str1 = "is looking good" str2 = "looks goo"
在这种情况下更新的
str2
将是"looking good"
示例 3:
str1 = "you are really looking good" str2 = "lok goo"
在这种情况下
str2
将是"good"
,因为lok
不接近looking
(或者即使程序可以在这种情况下转换lok
到looking
就可以解决我的问题了)示例 4:
str1 = "Stu is actually SEVERLY sunburnt....it hurts!!!" str2 = "hurts!!"
更新后的
str2
将是"hurts!!!"
示例 5:
str1 = "you guys were absolutely amazing tonight, a..." str2 = "ly amazin"
已更新
str2
将是"amazing"
,"ly"
应被删除或替换为绝对。
amazi
算法和代码是什么?
也许我们可以通过按字典顺序查看字符并设置一个
阈值为 0.8 或 80%,因此如果 word2
从 str1
中获取 word1
的 80% 连续字符,那么我们将 str2
中的 word2
替换为 [=38 的单词=]?
请使用 python 代码的任何其他有效解决方案?
在这种情况下,您可以使用 Jacard 系数。首先,您需要将第一个和第二个字符串分开 space。之后,对于 str2 中的每个字符串,对 str1 中的每个字符串取 Jacard 系数,然后用 Jacard 系数最高的字符串替换。
您可以使用 sklearn.metrics.jaccard_score
.
像这样:
str1 = "wow...it looks amazing"
str2 = "looks amazi"
str3 = []
# Checking for similar strings in both strings:
for n in str1.split():
for m in str2.split():
if m in n:
str3.append(n)
# If found 2 similar strings:
if len(str3) == 2:
# If their indexes align:
if str1.split().index(str3[1]) - str1.split().index(str3[0]) == 1:
print(' '.join(str3))
elif len(str3) == 1:
print(str3[0])
输出:
looks amazing
根据 OP 给出的条件进行更新:
str1 = "good..."
str2 = "god.."
str3 = []
# Checking for similar strings in both strings:
for n in str1.split():
for m in str2.split():
# Calculating matching character in the 2 words:
c = ''
for i in m:
if i in n:
c+=i
# If the amount of matching characters is greater or equal to 50% the length of the larger word
# or the smaller word is in the larger word:
if len(list(c)) >= len(n)*0.50 or m in n:
str3.append(n)
# If found 2 similar strings:
if len(str3) == 2:
# If their indexes align:
if str1.split().index(str3[1]) - str1.split().index(str3[0]) == 1:
print(' '.join(str3))
elif len(str3) == 1:
print(str3[0])
有很多方法可以解决这个问题。这个解决了你所有的例子。我仅向 return 添加了一个最小相似度过滤器,仅匹配更高质量的匹配项。这就是允许 'ly' 在最后一个示例中被删除的原因,因为它不是所有关闭任何单词的全部。
您可以使用 pip install python-Levenshtein
import Levenshtein
def find_match(str1,str2):
min_similarity = .75
output = []
results = [[Levenshtein.jaro_winkler(x,y) for x in str1.split()] for y in str2.split()]
for x in results:
if max(x) >= min_similarity:
output.append(str1.split()[x.index(max(x))])
return output
您提出的每个样本。
find_match("is looking good", "looks goo")
['looking','good']
find_match("you are really looking good", "lok goo")
['looking','good']
find_match("Stu is actually SEVERLY sunburnt....it hurts!!!", "hurts!!")
['hurts!!!']
find_match("you guys were absolutely amazing tonight, a...", "ly amazin")
['amazing']
我用正则表达式搞定了
def check_regex(str1,str2):
#New list to store the updated value
str_new = []
for i in str2:
# regular expression for comparing the strings
x = ['['+i+']','^'+i,i+'$','('+i+')']
for k in x:
h=0
for j in str1:
#Conditions to make sure the word is close enough to the particular word
if "".join(re.findall(k,j)) == i or ("".join(re.findall(k,j)) in i and abs(len("".join(re.findall(k,j)))-len(i)) == 1 and len(i)!=2):
str_new.append(j)
h=1
break
if h==1:
break
return str_new
import re
str1 = input().split()
str2 = input().split()
print(" ".join(check_regex(str1,str2)))