如何获取python中两个字符串之间的所有模糊匹配子串?

How to get all fuzzy matching substrings between two strings in python?

假设我有三个示例字符串

text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"

如果我得到 text2 和 text3 与 text1 的所有匹配子串,我会得到

text1_text2_common = [
    '3 days ago.',
]

text2_text3_common = [
    'of',
]

text1_text3_common = [
    'was',
    'idx'
    'every'
    'hours'
]

我正在寻找的是模糊匹配,使用类似于 Levenshtein distance 的东西。因此,即使子字符串不准确,如果它们足够相似以符合标准,它也会被选为子字符串。

所以理想情况下我正在寻找这样的东西:

text1_text3_common_fuzzy = [
    'prescription of idx, 20mg to be given every four hours'
]

也许您可以将此作为解决方案的入口。您可以将您的文本拆分为单词,将它们分组,并将它们与另一个子字符串(具有相同大小)和 return 以高于特定比率进行比较。我删除了逗号和点,因为它们与我无关。您可以使用任何其他比较工具而不是 SequenceMatcher(在某些情况下,您不必将两边拆分为相等大小的子字符串。您可以查看 fuzzywuzzy)。

你必须玩弄比例,才能得到你想要的结果。此外,您必须查看结果以删除结果,这些结果是其他结果的子字符串。这取决于您的需求:

from difflib import SequenceMatcher

text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"


def fuzzy_match(t1: str, t2: str, min_len: int = 3, max_len: int = 10, ratio_to_get: int = 0.6):
    t1 = t1.replace(".", "").replace(",", "")
    t2 = t2.replace(".", "").replace(",", "")

    result = set()
    t2_splitted = t2.split(" ")
    t1_splitted = t1.split(" ")
    for count in range(min_len, max_len, 1):
        for pos_2 in range(len(t2_splitted) - count + 1):
            substr_2 = " ".join(t2_splitted[pos_2: pos_2 + count])
            for pos_1 in range(len(t1_splitted) - count + 1):
                substr_1 = " ".join(t1_splitted[pos_1: pos_1 + count])
                ratio = SequenceMatcher(None, substr_1, substr_2).ratio()
                if ratio >= ratio_to_get:
                    result.add(substr_1)

    return result


if __name__ == '__main__':
    print(fuzzy_match(text1, text2))
    print(fuzzy_match(text2, text1))
    print(fuzzy_match(text1, text3))
    print(fuzzy_match(text3, text1))
    print(fuzzy_match(text2, text3))
    print(fuzzy_match(text3, text2))

输出:

{'days ago Patient', '3 days ago', 'started 3 days ago', 'which started 3 days ago', '3 days ago Patient', 'started 3 days'}
{'was 3 days', '3 days ago', 'discomfort was 3 days ago', 'was 3 days ago'}
{'prescribed idx 20 mg', 'mg every 4 hours', 'idx 20 mg every 4', '20 mg every 4 hours', 'Patient was prescribed idx 20 mg every 4', 'ago Patient was prescribed idx 20 mg every 4', 'ago Patient was prescribed idx 20 mg', 'was prescribed idx 20', 'ago Patient was prescribed idx 20', 'ago Patient was prescribed idx 20 mg every', 'prescribed idx 20', 'was prescribed idx 20 mg every', 'Patient was prescribed idx 20 mg', 'ago Patient was prescribed idx', 'was prescribed idx 20 mg', 'Patient was prescribed', 'prescribed idx 20 mg every', 'idx 20 mg', 'Patient was prescribed idx 20 mg every', 'was prescribed idx 20 mg every 4', 'idx 20 mg every', 'mg every 4', '20 mg every', 'Patient was prescribed idx 20', 'prescribed idx 20 mg every 4', 'every 4 hours'}
{'of idx 20mg to', 'given a prescription of idx 20mg to be', 'idx 20mg to be given', 'given a prescription of idx', 'be given every', 'idx 20mg to', 'every four hours', 'given every four', 'given a prescription of idx 20mg to', 'prescription of idx 20mg to be', 'a prescription of idx 20mg', 'prescription of idx 20mg to', 'prescription of idx 20mg', 'given a prescription of idx 20mg to be given', 'a prescription of idx 20mg to be', 'idx 20mg to be', 'given a prescription', 'to be given every four hours', 'be given every four', 'given every four hours', 'a prescription of idx 20mg to', 'be given every four hours', 'given a prescription of idx 20mg', 'a prescription of idx', 'prescription of idx', 'of idx 20mg'}
set()
set()

这里是一段代码,通过模糊比计算string1的sub-string和string2的full-string之间的相似度。该代码还可以处理 string2 的 sub-string 和 string1 的 full-string 以及 string1 的 sub-string 和 string2 的 sub-string。

这个使用 nltk 生成 ngram。

典型算法:

  1. 从给定的第一个字符串生成 ngram。
    示例:
    text2 = "不舒服的时间是3天前。"
    total_length = 8

在代码中,参数的值为 5、6、7、8。
参数 = 5
ngrams = ['The time of discomfort was', 'time of discomfort was 3', 'of discomfort was 3 days'、'discomfort was 3 days ago.']

  1. 将它与第二个字符串进行比较。
    示例:
    text1 = Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

@param=5

  • 比较 The time of discomfort wastext1 并得到模糊分数
  • 比较 time of discomfort was 3text1 并得到模糊分数
  • 依此类推,直到ngrams_5中的所有元素都完成
    如果模糊分数大于或等于给定阈值,则保存 sub-string。

@param=6

  • 比较 The time of discomfort was 3text1 并得到模糊分数
  • 等等

直到@param=8

您可以修改代码,将 n_start 更改为 5 左右,以便将 string1 的 ngram 与 string2 的 ngram 进行比较,在这种情况下,这是 sub-string 的比较string1 和 string2 的 sub-string。

# Generate ngrams for string2
n_start = 5  # st2_length
for n in range(n_start, st2_length + 1):
    ...

为了比较,我使用:

fratio = fuzz.token_set_ratio(fs1, fs2)

也看看 。您也可以尝试不同的比例。

您的样本 'prescription of idx, 20mg to be given every four hours' 的模糊得分为 52。

查看示例控制台输出。

7                    prescription of idx, 20mg to be given every four hours           52

代码

"""
fuzzy_match.py



Dependent modules:
    pip install pandas
    pip install nltk
    pip install fuzzywuzzy
    pip install python-Levenshtein

"""


from nltk.util import ngrams
import pandas as pd
from fuzzywuzzy import fuzz


# Sample strings.
text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"


def myprocess(st1: str, st2: str, threshold):
    """
    Generate sub-strings from st1 and compare with st2.
    The sub-strings, full string and fuzzy ratio will be saved in csv file.
    """
    data = []
    st1_length = len(st1.split())
    st2_length = len(st2.split())

    # Generate ngrams for string1
    m_start = 5
    for m in range(m_start, st1_length + 1):  # st1_length >= m_start

        # If m=3, fs1 = 'Patient has checked', 'has checked in', 'checked in for' ...
        # If m=5, fs1 = 'Patient has checked in for', 'has checked in for abdominal', ...
        for s1 in ngrams(st1.split(), m):
            fs1 = ' '.join(s1)
            
            # Generate ngrams for string2
            n_start = st2_length
            for n in range(n_start, st2_length + 1):
                for s2 in ngrams(st2.split(), n):
                    fs2 = ' '.join(s2)

                    fratio = fuzz.token_set_ratio(fs1, fs2)  # there are other ratios

                    # Save sub string if ratio is within threshold.
                    if fratio >= threshold:
                        data.append([fs1, fs2, fratio])

    return data


def get_match(sub, full, colname1, colname2, threshold=50):
    """
    sub: is a string where we extract the sub-string.
    full: is a string as the base/reference.
    threshold: is the minimum fuzzy ratio where we will save the sub string. Max fuzz ratio is 100.
    """   
    save = myprocess(sub, full, threshold)

    df = pd.DataFrame(save)
    if len(df):
        df.columns = [colname1, colname2, 'fuzzy_ratio']

        is_sort_by_fuzzy_ratio_first = True

        if is_sort_by_fuzzy_ratio_first:
            df = df.sort_values(by=['fuzzy_ratio', colname1], ascending=[False, False])
        else:
            df = df.sort_values(by=[colname1, 'fuzzy_ratio'], ascending=[False, False])

        df = df.reset_index(drop=True)

        df.to_csv(f'{colname1}_{colname2}.csv', index=False)

        # Print to console. Show only the sub-string and the fuzzy ratio. High ratio implies high similarity.
        df1 = df[[colname1, 'fuzzy_ratio']]
        print(df1.to_string())
        print()

        print(f'sub: {sub}')
        print(f'base: {full}')
        print()


def main():
    get_match(text2, text1, 'string2', 'string1', threshold=50)  # output string2_string1.csv
    get_match(text3, text1, 'string3', 'string1', threshold=50)

    get_match(text2, text3, 'string2', 'string3', threshold=10)

    # Other param combo.


if __name__ == '__main__':
    main()

控制台输出

                                  string2  fuzzy_ratio
0              discomfort was 3 days ago.           72
1           of discomfort was 3 days ago.           67
2      time of discomfort was 3 days ago.           60
3                of discomfort was 3 days           59
4  The time of discomfort was 3 days ago.           55
5           time of discomfort was 3 days           51

sub: The time of discomfort was 3 days ago.
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

                                                                    string3  fuzzy_ratio
0                                                 be given every four hours           61
1                                    idx, 20mg to be given every four hours           58
2        was given a prescription of idx, 20mg to be given every four hours           56
3                                              to be given every four hours           56
4   John was given a prescription of idx, 20mg to be given every four hours           56
5                                 of idx, 20mg to be given every four hours           55
6              was given a prescription of idx, 20mg to be given every four           52
7                    prescription of idx, 20mg to be given every four hours           52
8            given a prescription of idx, 20mg to be given every four hours           52
9                  a prescription of idx, 20mg to be given every four hours           52
10        John was given a prescription of idx, 20mg to be given every four           52
11                                              idx, 20mg to be given every           51
12                                        20mg to be given every four hours           50

sub: John was given a prescription of idx, 20mg to be given every four hours
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

                                  string2  fuzzy_ratio
0      time of discomfort was 3 days ago.           41
1           time of discomfort was 3 days           41
2                time of discomfort was 3           40
3                of discomfort was 3 days           40
4  The time of discomfort was 3 days ago.           40
5           of discomfort was 3 days ago.           39
6       The time of discomfort was 3 days           39
7              The time of discomfort was           38
8            The time of discomfort was 3           35
9              discomfort was 3 days ago.           34

sub: The time of discomfort was 3 days ago.
base: John was given a prescription of idx, 20mg to be given every four hours

CSV 输出示例

string2_string1.csv

使用 Spacy 相似性

这里是使用spacy.

比较text3的sub-string和text1的全文的结果

下面的结果旨在与上面的第 2 个 table 进行比较,看看哪种方法的相似度排名更好。

我使用大 model 得到下面的结果。

代码

import spacy
import pandas as pd


nlp = spacy.load("en_core_web_lg")

text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"

text3_sub = [
    'be given every four hours', 'idx, 20mg to be given every four hours',
    'was given a prescription of idx, 20mg to be given every four hours',
    'to be given every four hours',
    'John was given a prescription of idx, 20mg to be given every four hours',
    'of idx, 20mg to be given every four hours',
    'was given a prescription of idx, 20mg to be given every four',
    'prescription of idx, 20mg to be given every four hours',
    'given a prescription of idx, 20mg to be given every four hours',
    'a prescription of idx, 20mg to be given every four hours',
    'John was given a prescription of idx, 20mg to be given every four',
    'idx, 20mg to be given every',
    '20mg to be given every four hours'
]


data = []
for s in text3_sub:
    doc1 = nlp(s)
    doc2 = nlp(text1)
    sim = round(doc1.similarity(doc2), 3)
    data.append([s, text1, sim])

df = pd.DataFrame(data)
df.columns = ['from text3', 'text1', 'similarity']
df = df.sort_values(by=['similarity'], ascending=[False])
df = df.reset_index(drop=True)

df1 = df[['from text3', 'similarity']]
print(df1.to_string())

print()
print(f'text3: {text3}')
print(f'text1: {text1}')

输出

                                                                 from text3  similarity
0        was given a prescription of idx, 20mg to be given every four hours       0.904
1   John was given a prescription of idx, 20mg to be given every four hours       0.902
2                  a prescription of idx, 20mg to be given every four hours       0.895
3                    prescription of idx, 20mg to be given every four hours       0.893
4            given a prescription of idx, 20mg to be given every four hours       0.892
5                                 of idx, 20mg to be given every four hours       0.889
6                                    idx, 20mg to be given every four hours       0.883
7              was given a prescription of idx, 20mg to be given every four       0.879
8         John was given a prescription of idx, 20mg to be given every four       0.877
9                                         20mg to be given every four hours       0.877
10                                              idx, 20mg to be given every       0.835
11                                             to be given every four hours       0.834
12                                                be given every four hours       0.832

text3: John was given a prescription of idx, 20mg to be given every four hours
text1: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

看起来 spacy 方法产生了一个很好的相似度排名。

为此,性能最高的算法是构造一个有限状态传感器并将其用于模糊匹配。

但是还有其他更简单的方法,例如前缀尝试。

查看下面的链接,阅读更多关于此主题的知识渊博的人。

Peter Norvig's How to Write a Spelling Corrector 包含实现此功能的示例,例如 Google 的搜索栏可以自动更正拼写错误并链接到其他语言的示例。

Kay Schlühr's articles on fuzzy string matching 更深入、更优质地涵盖了这个主题,比我所能总结的要多得多。

如果您对 ElasticSearch 的支持库 Lucene 如何执行此操作更感兴趣,Michael McCandless 写了一篇关于该主题的 blog post on FSTs and an academic paper

如果您正在寻找(当前)前沿技术,请查看 pisa,它具有最先进的信息检索算法。

简答:

def match(a,b):
    a,b = a.lower(), b.lower()
    error = 0
    for i in string.ascii_lowercase:
            error += abs(a.count(i) - b.count(i))
    total = len(a) + len(b)
    return (total-error)/total
def what_you_need(s1, s2, fuziness=0.8):
    if match(s1, s2) < fuziness:
        syms = []
        for s in s1:
            if s in s2:
                syms.append(s)
        return "".join(syms)