查找近似字符串匹配并替换为预定义字符串的有效方法
Efficient way to find an approximate string match and replacing with predefined string
我需要建立一个NER
系统(Named Entity Recognition). For simplicity, I am doing it by using approximate string matching as input can contain typos and other minor modifications. I have come across some great libraries like: fuzzywuzzy or even faster RapidFuzz。但不幸的是我没有找到return匹配发生的位置的方法。作为我的目的我不仅需要找到匹配项,还需要知道匹配项发生在哪里。至于NER
,我需要用一些预定义的字符串替换那些匹配项。
例如,如果在输入字符串中找到任何一行,我想用字符串 COMPANY_NAME
:
替换它们
google
microsoft
facebook
International Business Machine
点赞,输入:S/he works at Google
会转化为S/he works at COMPANY_NAME
。
您可以放心地假设,所有输入和要匹配的模式都已经过预处理,最重要的是它们现在都是小写的。所以,区分大小写是没有问题的。
目前,我已经采用滑动 window 技术。一个滑动 window 从左到右传递到输入字符串上,这个 window 正好有我们想要匹配的模式的大小。例如,当我想与 International Business Machine
匹配时,我 运行 从左到右滑动 window 大小 3
并尝试通过观察每个 3
同时以 1
步长连续标记。我相信,这不是最好的方法,它也找不到 最佳 匹配项。
那么,找到最佳可能匹配的有效方法是什么,以及找到的匹配的量化(它们的相似程度)和匹配的位置( es),这样我们就可以用给定的固定字符串替换它们(如果计算出的相似度不小于阈值)?显然,单个输入可能包含多个要替换的部分,每个部分将被单独替换,如:Google and Microsoft are big companies
将变为 COMPANY_NAME and COMPANY_NAME are big companies
等
模块 fuzzywuzzy
和 RapidFuzz
似乎没有这方面的功能。您可以尝试使用 process.extract()
或 process.extractOne()
但它需要将文本拆分为更小的部分(即单词)并分别检查每个部分。对于像 International Business Machine
这样较长的单词,它需要分成 3 个单词的一部分 - 所以它需要更多的工作。
我认为你需要的是模块 fuzzysearch
import fuzzysearch
words = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like International Business Machine'
print(' text:', text)
print('---')
for word in sorted(words, key=len, reverse=True):
print(' word:', word)
results = fuzzysearch.find_near_matches(word, text, max_l_dist=1)
print('found:', results)
for item in reversed(results):
text = text[:item.start] + 'COMPANY' + text[item.end:]
print(' text:', text)
print('---')
结果:
text: Google and Microsoft are big companies like facebook International Business Machine
---
word: International Business Machine
found: [Match(start=53, end=83, dist=0, matched='International Business Machine')]
text: Google and Microsoft are big companies like facebook COMPANY
---
word: microsoft
found: [Match(start=11, end=20, dist=1, matched='Microsoft')]
text: Google and COMPANY are big companies like facebook COMPANY
---
word: facebook
found: [Match(start=42, end=50, dist=0, matched='facebook')]
text: Google and COMPANY are big companies like COMPANY COMPANY
---
word: google
found: [Match(start=0, end=6, dist=1, matched='Google')]
text: COMPANY and COMPANY are big companies like COMPANY COMPANY
如果它为一个词找到很多结果,那么最好从最后一个位置开始替换,以将其他词保留在同一位置。这就是为什么我使用 reversed()
.
我也会从最长的 word/name 开始,所以稍后它仍然可以搜索较短的单词,例如 Business
。这就是我使用 sorted(..., key=len, reverse=True)
的原因
但我不确定它是否完全符合您的要求。可能词多不对就会出问题
编辑:
我尝试为此使用 fuzzywuzzy
并创建了这个版本,但仅适用于单个单词的名称。对于International Business Machine
,它需要一些其他的想法。
它将全文拆分成单词并比较单词。稍后替换有口粮的单词 > 80
words = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like International Business Machine'
# ---
import fuzzywuzzy.fuzz as fuzz
#import fuzzywuzzy.process
new_words = []
for part in text.split():
matches = []
for word in words:
result = fuzz.token_sort_ratio(part, word)
matches.append([result, part, word])
#print([result, part, word])
matches = sorted(matches, reverse=True)
if matches and matches[0][0] > 80:
new_words.append('COMPANY')
else:
new_words.append(matches[0][1])
print(" ".join(new_words))
结果:
[100, 'Google', 'google']
[27, 'Google', 'microsoft']
[29, 'Google', 'facebook']
[17, 'Google', 'International Business Machine']
[0, 'and', 'google']
[0, 'and', 'microsoft']
[18, 'and', 'facebook']
[12, 'and', 'International Business Machine']
[27, 'Microsoft', 'google']
[100, 'Microsoft', 'microsoft']
[35, 'Microsoft', 'facebook']
[15, 'Microsoft', 'International Business Machine']
[22, 'are', 'google']
[17, 'are', 'microsoft']
[36, 'are', 'facebook']
[12, 'are', 'International Business Machine']
[22, 'big', 'google']
[17, 'big', 'microsoft']
[18, 'big', 'facebook']
[12, 'big', 'International Business Machine']
[27, 'companies', 'google']
[33, 'companies', 'microsoft']
[24, 'companies', 'facebook']
[26, 'companies', 'International Business Machine']
[40, 'like', 'google']
[15, 'like', 'microsoft']
[17, 'like', 'facebook']
[18, 'like', 'International Business Machine']
[21, 'International', 'google']
[27, 'International', 'microsoft']
[19, 'International', 'facebook']
[60, 'International', 'International Business Machine']
[14, 'Business', 'google']
[24, 'Business', 'microsoft']
[12, 'Business', 'facebook']
[42, 'Business', 'International Business Machine']
[15, 'Machine', 'google']
[25, 'Machine', 'microsoft']
[40, 'Machine', 'facebook']
[38, 'Machine', 'International Business Machine']
COMPANY and COMPANY are big companies like International Business Machine
编辑:
第二个版本也检查名字有很多词
all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like International Business Machine'
# ---
import fuzzywuzzy.fuzz as fuzz
for name in all_names:
length = len(name.split(' ')) # how many words has name
print('name length:', length, '|', name)
words = text.split() # split text into words
# compare name with all words in text
matches = []
for index in range(0, len(words)-length+1):
# join words if name has more then 1 word
part = " ".join(words[index:index+length])
#print('part:', part)
result = fuzz.token_sort_ratio(part, name)
matches.append([result, name, part, [index, index+length]])
print([result, name, part, [index, index+length]])
# reverse to start at last position
matches = list(reversed(matches))
max_match = max(x[0] for x in matches)
print('max match:', max_match)
# replace
if max_match > 80:
for match in matches:
if match[0] == max_match:
idx = match[3]
words = words[:idx[0]] + ['COMPANY'] + words[idx[1]:]
text = " ".join(words)
print('text:', text)
print('---')
结果:
ame length: 1 | google
[100, 'google', 'Google', [0, 1]]
[0, 'google', 'and', [1, 2]]
[27, 'google', 'Microsoft', [2, 3]]
[22, 'google', 'are', [3, 4]]
[22, 'google', 'big', [4, 5]]
[27, 'google', 'companies', [5, 6]]
[40, 'google', 'like', [6, 7]]
[21, 'google', 'International', [7, 8]]
[14, 'google', 'Business', [8, 9]]
[15, 'google', 'Machine', [9, 10]]
max match: 100
text: COMPANY and Microsoft are big companies like International Business Machine
---
name length: 1 | microsoft
[25, 'microsoft', 'COMPANY', [0, 1]]
[0, 'microsoft', 'and', [1, 2]]
[100, 'microsoft', 'Microsoft', [2, 3]]
[17, 'microsoft', 'are', [3, 4]]
[17, 'microsoft', 'big', [4, 5]]
[33, 'microsoft', 'companies', [5, 6]]
[15, 'microsoft', 'like', [6, 7]]
[27, 'microsoft', 'International', [7, 8]]
[24, 'microsoft', 'Business', [8, 9]]
[25, 'microsoft', 'Machine', [9, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 1 | facebook
[27, 'facebook', 'COMPANY', [0, 1]]
[18, 'facebook', 'and', [1, 2]]
[27, 'facebook', 'COMPANY', [2, 3]]
[36, 'facebook', 'are', [3, 4]]
[18, 'facebook', 'big', [4, 5]]
[24, 'facebook', 'companies', [5, 6]]
[17, 'facebook', 'like', [6, 7]]
[19, 'facebook', 'International', [7, 8]]
[12, 'facebook', 'Business', [8, 9]]
[40, 'facebook', 'Machine', [9, 10]]
max match: 40
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 3 | International Business Machine
[33, 'International Business Machine', 'COMPANY and COMPANY', [0, 3]]
[31, 'International Business Machine', 'and COMPANY are', [1, 4]]
[31, 'International Business Machine', 'COMPANY are big', [2, 5]]
[34, 'International Business Machine', 'are big companies', [3, 6]]
[38, 'International Business Machine', 'big companies like', [4, 7]]
[69, 'International Business Machine', 'companies like International', [5, 8]]
[88, 'International Business Machine', 'like International Business', [6, 9]]
[100, 'International Business Machine', 'International Business Machine', [7, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like COMPANY
编辑:
版本 fuzzywuzzy.process
这次我没有位置,我只是使用标准text.replace(item[0], 'COMPANY')
。
我认为在大多数情况下它都能正常工作,不需要更好的方法。
这次是在文字上查错的:
'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'
all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like Facebook and International Business Machine'
# text with mistakes
text = 'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'
# ---
import fuzzywuzzy.process
#import fuzzywuzzy.fuzz
for name in sorted(all_names, key=len, reverse=True):
lenght = len(name.split())
words = text.split()
words = [" ".join(words[i:i+lenght]) for i in range(0, len(words)-lenght+1)]
#print(words)
#result = fuzzywuzzy.process.extractBests(name, words, scorer=fuzzywuzzy.fuzz.token_sort_ratio, score_cutoff=80)
result = fuzzywuzzy.process.extractBests(name, words, score_cutoff=80)
print(name, result)
for item in result:
text = text.replace(item[0], 'COMPANY')
print(text)
我需要建立一个NER
系统(Named Entity Recognition). For simplicity, I am doing it by using approximate string matching as input can contain typos and other minor modifications. I have come across some great libraries like: fuzzywuzzy or even faster RapidFuzz。但不幸的是我没有找到return匹配发生的位置的方法。作为我的目的我不仅需要找到匹配项,还需要知道匹配项发生在哪里。至于NER
,我需要用一些预定义的字符串替换那些匹配项。
例如,如果在输入字符串中找到任何一行,我想用字符串 COMPANY_NAME
:
google
microsoft
facebook
International Business Machine
点赞,输入:S/he works at Google
会转化为S/he works at COMPANY_NAME
。
您可以放心地假设,所有输入和要匹配的模式都已经过预处理,最重要的是它们现在都是小写的。所以,区分大小写是没有问题的。
目前,我已经采用滑动 window 技术。一个滑动 window 从左到右传递到输入字符串上,这个 window 正好有我们想要匹配的模式的大小。例如,当我想与 International Business Machine
匹配时,我 运行 从左到右滑动 window 大小 3
并尝试通过观察每个 3
同时以 1
步长连续标记。我相信,这不是最好的方法,它也找不到 最佳 匹配项。
那么,找到最佳可能匹配的有效方法是什么,以及找到的匹配的量化(它们的相似程度)和匹配的位置( es),这样我们就可以用给定的固定字符串替换它们(如果计算出的相似度不小于阈值)?显然,单个输入可能包含多个要替换的部分,每个部分将被单独替换,如:Google and Microsoft are big companies
将变为 COMPANY_NAME and COMPANY_NAME are big companies
等
模块 fuzzywuzzy
和 RapidFuzz
似乎没有这方面的功能。您可以尝试使用 process.extract()
或 process.extractOne()
但它需要将文本拆分为更小的部分(即单词)并分别检查每个部分。对于像 International Business Machine
这样较长的单词,它需要分成 3 个单词的一部分 - 所以它需要更多的工作。
我认为你需要的是模块 fuzzysearch
import fuzzysearch
words = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like International Business Machine'
print(' text:', text)
print('---')
for word in sorted(words, key=len, reverse=True):
print(' word:', word)
results = fuzzysearch.find_near_matches(word, text, max_l_dist=1)
print('found:', results)
for item in reversed(results):
text = text[:item.start] + 'COMPANY' + text[item.end:]
print(' text:', text)
print('---')
结果:
text: Google and Microsoft are big companies like facebook International Business Machine
---
word: International Business Machine
found: [Match(start=53, end=83, dist=0, matched='International Business Machine')]
text: Google and Microsoft are big companies like facebook COMPANY
---
word: microsoft
found: [Match(start=11, end=20, dist=1, matched='Microsoft')]
text: Google and COMPANY are big companies like facebook COMPANY
---
word: facebook
found: [Match(start=42, end=50, dist=0, matched='facebook')]
text: Google and COMPANY are big companies like COMPANY COMPANY
---
word: google
found: [Match(start=0, end=6, dist=1, matched='Google')]
text: COMPANY and COMPANY are big companies like COMPANY COMPANY
如果它为一个词找到很多结果,那么最好从最后一个位置开始替换,以将其他词保留在同一位置。这就是为什么我使用 reversed()
.
我也会从最长的 word/name 开始,所以稍后它仍然可以搜索较短的单词,例如 Business
。这就是我使用 sorted(..., key=len, reverse=True)
但我不确定它是否完全符合您的要求。可能词多不对就会出问题
编辑:
我尝试为此使用 fuzzywuzzy
并创建了这个版本,但仅适用于单个单词的名称。对于International Business Machine
,它需要一些其他的想法。
它将全文拆分成单词并比较单词。稍后替换有口粮的单词 > 80
words = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like International Business Machine'
# ---
import fuzzywuzzy.fuzz as fuzz
#import fuzzywuzzy.process
new_words = []
for part in text.split():
matches = []
for word in words:
result = fuzz.token_sort_ratio(part, word)
matches.append([result, part, word])
#print([result, part, word])
matches = sorted(matches, reverse=True)
if matches and matches[0][0] > 80:
new_words.append('COMPANY')
else:
new_words.append(matches[0][1])
print(" ".join(new_words))
结果:
[100, 'Google', 'google']
[27, 'Google', 'microsoft']
[29, 'Google', 'facebook']
[17, 'Google', 'International Business Machine']
[0, 'and', 'google']
[0, 'and', 'microsoft']
[18, 'and', 'facebook']
[12, 'and', 'International Business Machine']
[27, 'Microsoft', 'google']
[100, 'Microsoft', 'microsoft']
[35, 'Microsoft', 'facebook']
[15, 'Microsoft', 'International Business Machine']
[22, 'are', 'google']
[17, 'are', 'microsoft']
[36, 'are', 'facebook']
[12, 'are', 'International Business Machine']
[22, 'big', 'google']
[17, 'big', 'microsoft']
[18, 'big', 'facebook']
[12, 'big', 'International Business Machine']
[27, 'companies', 'google']
[33, 'companies', 'microsoft']
[24, 'companies', 'facebook']
[26, 'companies', 'International Business Machine']
[40, 'like', 'google']
[15, 'like', 'microsoft']
[17, 'like', 'facebook']
[18, 'like', 'International Business Machine']
[21, 'International', 'google']
[27, 'International', 'microsoft']
[19, 'International', 'facebook']
[60, 'International', 'International Business Machine']
[14, 'Business', 'google']
[24, 'Business', 'microsoft']
[12, 'Business', 'facebook']
[42, 'Business', 'International Business Machine']
[15, 'Machine', 'google']
[25, 'Machine', 'microsoft']
[40, 'Machine', 'facebook']
[38, 'Machine', 'International Business Machine']
COMPANY and COMPANY are big companies like International Business Machine
编辑:
第二个版本也检查名字有很多词
all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like International Business Machine'
# ---
import fuzzywuzzy.fuzz as fuzz
for name in all_names:
length = len(name.split(' ')) # how many words has name
print('name length:', length, '|', name)
words = text.split() # split text into words
# compare name with all words in text
matches = []
for index in range(0, len(words)-length+1):
# join words if name has more then 1 word
part = " ".join(words[index:index+length])
#print('part:', part)
result = fuzz.token_sort_ratio(part, name)
matches.append([result, name, part, [index, index+length]])
print([result, name, part, [index, index+length]])
# reverse to start at last position
matches = list(reversed(matches))
max_match = max(x[0] for x in matches)
print('max match:', max_match)
# replace
if max_match > 80:
for match in matches:
if match[0] == max_match:
idx = match[3]
words = words[:idx[0]] + ['COMPANY'] + words[idx[1]:]
text = " ".join(words)
print('text:', text)
print('---')
结果:
ame length: 1 | google
[100, 'google', 'Google', [0, 1]]
[0, 'google', 'and', [1, 2]]
[27, 'google', 'Microsoft', [2, 3]]
[22, 'google', 'are', [3, 4]]
[22, 'google', 'big', [4, 5]]
[27, 'google', 'companies', [5, 6]]
[40, 'google', 'like', [6, 7]]
[21, 'google', 'International', [7, 8]]
[14, 'google', 'Business', [8, 9]]
[15, 'google', 'Machine', [9, 10]]
max match: 100
text: COMPANY and Microsoft are big companies like International Business Machine
---
name length: 1 | microsoft
[25, 'microsoft', 'COMPANY', [0, 1]]
[0, 'microsoft', 'and', [1, 2]]
[100, 'microsoft', 'Microsoft', [2, 3]]
[17, 'microsoft', 'are', [3, 4]]
[17, 'microsoft', 'big', [4, 5]]
[33, 'microsoft', 'companies', [5, 6]]
[15, 'microsoft', 'like', [6, 7]]
[27, 'microsoft', 'International', [7, 8]]
[24, 'microsoft', 'Business', [8, 9]]
[25, 'microsoft', 'Machine', [9, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 1 | facebook
[27, 'facebook', 'COMPANY', [0, 1]]
[18, 'facebook', 'and', [1, 2]]
[27, 'facebook', 'COMPANY', [2, 3]]
[36, 'facebook', 'are', [3, 4]]
[18, 'facebook', 'big', [4, 5]]
[24, 'facebook', 'companies', [5, 6]]
[17, 'facebook', 'like', [6, 7]]
[19, 'facebook', 'International', [7, 8]]
[12, 'facebook', 'Business', [8, 9]]
[40, 'facebook', 'Machine', [9, 10]]
max match: 40
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 3 | International Business Machine
[33, 'International Business Machine', 'COMPANY and COMPANY', [0, 3]]
[31, 'International Business Machine', 'and COMPANY are', [1, 4]]
[31, 'International Business Machine', 'COMPANY are big', [2, 5]]
[34, 'International Business Machine', 'are big companies', [3, 6]]
[38, 'International Business Machine', 'big companies like', [4, 7]]
[69, 'International Business Machine', 'companies like International', [5, 8]]
[88, 'International Business Machine', 'like International Business', [6, 9]]
[100, 'International Business Machine', 'International Business Machine', [7, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like COMPANY
编辑:
版本 fuzzywuzzy.process
这次我没有位置,我只是使用标准text.replace(item[0], 'COMPANY')
。
我认为在大多数情况下它都能正常工作,不需要更好的方法。
这次是在文字上查错的:
'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'
all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like Facebook and International Business Machine'
# text with mistakes
text = 'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'
# ---
import fuzzywuzzy.process
#import fuzzywuzzy.fuzz
for name in sorted(all_names, key=len, reverse=True):
lenght = len(name.split())
words = text.split()
words = [" ".join(words[i:i+lenght]) for i in range(0, len(words)-lenght+1)]
#print(words)
#result = fuzzywuzzy.process.extractBests(name, words, scorer=fuzzywuzzy.fuzz.token_sort_ratio, score_cutoff=80)
result = fuzzywuzzy.process.extractBests(name, words, score_cutoff=80)
print(name, result)
for item in result:
text = text.replace(item[0], 'COMPANY')
print(text)