查找近似字符串匹配并替换为预定义字符串的有效方法

Efficient way to find an approximate string match and replacing with predefined string

我需要建立一个NER系统(Named Entity Recognition). For simplicity, I am doing it by using approximate string matching as input can contain typos and other minor modifications. I have come across some great libraries like: fuzzywuzzy or even faster RapidFuzz。但不幸的是我没有找到return匹配发生的位置的方法。作为我的目的我不仅需要找到匹配项,还需要知道匹配项发生在哪里。至于NER,我需要用一些预定义的字符串替换那些匹配项。

例如,如果在输入字符串中找到任何一行,我想用字符串 COMPANY_NAME:

替换它们
google
microsoft
facebook
International Business Machine

点赞,输入:S/he works at Google会转化为S/he works at COMPANY_NAME。 您可以放心地假设,所有输入和要匹配的模式都已经过预处理,最重要的是它们现在都是小写的。所以,区分大小写是没有问题的。

目前,我已经采用滑动 window 技术。一个滑动 window 从左到右传递到输入字符串上,这个 window 正好有我们想要匹配的模式的大小。例如,当我想与 International Business Machine 匹配时,我 运行 从左到右滑动 window 大小 3 并尝试通过观察每个 3 同时以 1 步长连续标记。我相信,这不是最好的方法,它也找不到 最佳 匹配项。

那么,找到最佳可能匹配的有效方法是什么,以及找到的匹配的量化(它们的相似程度)和匹配的位置( es),这样我们就可以用给定的固定字符串替换它们(如果计算出的相似度不小于阈值)?显然,单个输入可能包含多个要替换的部分,每个部分将被单独替换,如:Google and Microsoft are big companies 将变为 COMPANY_NAME and COMPANY_NAME are big companies

模块 fuzzywuzzyRapidFuzz 似乎没有这方面的功能。您可以尝试使用 process.extract()process.extractOne() 但它需要将文本拆分为更小的部分(即单词)并分别检查每个部分。对于像 International Business Machine 这样较长的单词,它需要分成 3 个单词的一部分 - 所以它需要更多的工作。


我认为你需要的是模块 fuzzysearch

import fuzzysearch

words = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like International Business Machine'

print(' text:', text)
print('---')
    
for word in sorted(words, key=len, reverse=True):
    print(' word:', word)
    
    results = fuzzysearch.find_near_matches(word, text, max_l_dist=1)
    print('found:', results)
    
    for item in reversed(results):
        text = text[:item.start] + 'COMPANY' + text[item.end:]
    print(' text:', text)
    
    print('---')

结果:

 text: Google and Microsoft are big companies like facebook International Business Machine
---
 word: International Business Machine
found: [Match(start=53, end=83, dist=0, matched='International Business Machine')]
 text: Google and Microsoft are big companies like facebook COMPANY
---
 word: microsoft
found: [Match(start=11, end=20, dist=1, matched='Microsoft')]
 text: Google and COMPANY are big companies like facebook COMPANY
---
 word: facebook
found: [Match(start=42, end=50, dist=0, matched='facebook')]
 text: Google and COMPANY are big companies like COMPANY COMPANY
---
 word: google
found: [Match(start=0, end=6, dist=1, matched='Google')]
 text: COMPANY and COMPANY are big companies like COMPANY COMPANY

如果它为一个词找到很多结果,那么最好从最后一个位置开始替换,以将其他词保留在同一位置。这就是为什么我使用 reversed().

我也会从最长的 word/name 开始,所以稍后它仍然可以搜索较短的单词,例如 Business。这就是我使用 sorted(..., key=len, reverse=True)

的原因

但我不确定它是否完全符合您的要求。可能词多不对就会出问题


编辑:

我尝试为此使用 fuzzywuzzy 并创建了这个版本,但仅适用于单个单词的名称。对于International Business Machine,它需要一些其他的想法。

它将全文拆分成单词并比较单词。稍后替换有口粮的单词 > 80

words = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like International Business Machine'

# ---

import fuzzywuzzy.fuzz as fuzz
#import fuzzywuzzy.process

new_words = []

for part in text.split():

    matches = []

    for word in words:
        result = fuzz.token_sort_ratio(part, word)
        matches.append([result, part, word])
        #print([result, part, word])

    matches = sorted(matches, reverse=True)

    if matches and matches[0][0] > 80:
        new_words.append('COMPANY')
    else:
        new_words.append(matches[0][1])
        
print(" ".join(new_words))

结果:

[100, 'Google', 'google']
[27, 'Google', 'microsoft']
[29, 'Google', 'facebook']
[17, 'Google', 'International Business Machine']
[0, 'and', 'google']
[0, 'and', 'microsoft']
[18, 'and', 'facebook']
[12, 'and', 'International Business Machine']
[27, 'Microsoft', 'google']
[100, 'Microsoft', 'microsoft']
[35, 'Microsoft', 'facebook']
[15, 'Microsoft', 'International Business Machine']
[22, 'are', 'google']
[17, 'are', 'microsoft']
[36, 'are', 'facebook']
[12, 'are', 'International Business Machine']
[22, 'big', 'google']
[17, 'big', 'microsoft']
[18, 'big', 'facebook']
[12, 'big', 'International Business Machine']
[27, 'companies', 'google']
[33, 'companies', 'microsoft']
[24, 'companies', 'facebook']
[26, 'companies', 'International Business Machine']
[40, 'like', 'google']
[15, 'like', 'microsoft']
[17, 'like', 'facebook']
[18, 'like', 'International Business Machine']
[21, 'International', 'google']
[27, 'International', 'microsoft']
[19, 'International', 'facebook']
[60, 'International', 'International Business Machine']
[14, 'Business', 'google']
[24, 'Business', 'microsoft']
[12, 'Business', 'facebook']
[42, 'Business', 'International Business Machine']
[15, 'Machine', 'google']
[25, 'Machine', 'microsoft']
[40, 'Machine', 'facebook']
[38, 'Machine', 'International Business Machine']
COMPANY and COMPANY are big companies like International Business Machine

编辑:

第二个版本也检查名字有很多词

all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like International Business Machine'

# ---

import fuzzywuzzy.fuzz as fuzz


for name in all_names:

    length = len(name.split(' ')) # how many words has name 
    print('name length:', length, '|', name)

    words = text.split()  # split text into words

    # compare name with all words in text
    
    matches = []
    
    for index in range(0, len(words)-length+1):
        # join words if name has more then 1 word
        part = " ".join(words[index:index+length])
        #print('part:', part)
        
        result = fuzz.token_sort_ratio(part, name)
        matches.append([result, name, part, [index, index+length]])

        print([result, name, part, [index, index+length]])
        
    # reverse to start at last position
    matches = list(reversed(matches))

    max_match = max(x[0] for x in matches)
    print('max match:', max_match)

    # replace
    if max_match > 80:
        for match in matches:
            if  match[0] == max_match:
                idx = match[3]  
                words = words[:idx[0]] + ['COMPANY'] + words[idx[1]:]

    text = " ".join(words)
    print('text:', text)
    print('---')

结果:

ame length: 1 | google
[100, 'google', 'Google', [0, 1]]
[0, 'google', 'and', [1, 2]]
[27, 'google', 'Microsoft', [2, 3]]
[22, 'google', 'are', [3, 4]]
[22, 'google', 'big', [4, 5]]
[27, 'google', 'companies', [5, 6]]
[40, 'google', 'like', [6, 7]]
[21, 'google', 'International', [7, 8]]
[14, 'google', 'Business', [8, 9]]
[15, 'google', 'Machine', [9, 10]]
max match: 100
text: COMPANY and Microsoft are big companies like International Business Machine
---
name length: 1 | microsoft
[25, 'microsoft', 'COMPANY', [0, 1]]
[0, 'microsoft', 'and', [1, 2]]
[100, 'microsoft', 'Microsoft', [2, 3]]
[17, 'microsoft', 'are', [3, 4]]
[17, 'microsoft', 'big', [4, 5]]
[33, 'microsoft', 'companies', [5, 6]]
[15, 'microsoft', 'like', [6, 7]]
[27, 'microsoft', 'International', [7, 8]]
[24, 'microsoft', 'Business', [8, 9]]
[25, 'microsoft', 'Machine', [9, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 1 | facebook
[27, 'facebook', 'COMPANY', [0, 1]]
[18, 'facebook', 'and', [1, 2]]
[27, 'facebook', 'COMPANY', [2, 3]]
[36, 'facebook', 'are', [3, 4]]
[18, 'facebook', 'big', [4, 5]]
[24, 'facebook', 'companies', [5, 6]]
[17, 'facebook', 'like', [6, 7]]
[19, 'facebook', 'International', [7, 8]]
[12, 'facebook', 'Business', [8, 9]]
[40, 'facebook', 'Machine', [9, 10]]
max match: 40
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 3 | International Business Machine
[33, 'International Business Machine', 'COMPANY and COMPANY', [0, 3]]
[31, 'International Business Machine', 'and COMPANY are', [1, 4]]
[31, 'International Business Machine', 'COMPANY are big', [2, 5]]
[34, 'International Business Machine', 'are big companies', [3, 6]]
[38, 'International Business Machine', 'big companies like', [4, 7]]
[69, 'International Business Machine', 'companies like International', [5, 8]]
[88, 'International Business Machine', 'like International Business', [6, 9]]
[100, 'International Business Machine', 'International Business Machine', [7, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like COMPANY

编辑:

版本 fuzzywuzzy.process

这次我没有位置,我只是使用标准text.replace(item[0], 'COMPANY')

我认为在大多数情况下它都能正常工作,不需要更好的方法。

这次是在文字上查错的:

'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'

all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like Facebook and International Business Machine'

# text with mistakes
text = 'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'

# ---

import fuzzywuzzy.process
#import fuzzywuzzy.fuzz

for name in sorted(all_names, key=len, reverse=True):
    lenght = len(name.split())

    words = text.split()
    words = [" ".join(words[i:i+lenght]) for i in range(0, len(words)-lenght+1)]
    #print(words)

    #result = fuzzywuzzy.process.extractBests(name, words, scorer=fuzzywuzzy.fuzz.token_sort_ratio, score_cutoff=80)
    result = fuzzywuzzy.process.extractBests(name, words, score_cutoff=80)
    print(name, result)

    for item in result:
        text = text.replace(item[0], 'COMPANY')

print(text)