Python: 查找并统计txt文件中单词的精确匹配和近似匹配

Python: Finding and counting exact and approximate matches of words in txt file

我的程序快要完成我想要它做的事情了,但我有一个问题:我试图查找的许多关键字中间可能有符号或拼写错误。因此,我想将拼写错误的单词算作关键字匹配,就好像它们拼写正确一样。例如,假设我的文字是:“settlement settl#7*nt se##tl#ment ann&&ity annuity。”

我想统计 .txt 文件包含关键字“settlement”和“annuity”的次数,还有以“sett”开头和以“nt”结尾的单词作为“settlement”以及以“”开头的单词ann" 并以 "y" 结尾作为年金。

我已经能够准确地数出单词数,并且非常接近我想要它做的事情。但现在我想做近似匹配。我什至不确定这是可能的。谢谢

out1 = open("seen.txt", "w")
out2 = open("missing.txt", "w")

def count_words_in_dir(dirpath, words, action=None):
    for filepath in glob.iglob(os.path.join("/Settlement", '*.txt')):
        with open(filepath) as f:
            data = f.read()
            for key, val in words.items():
                # print("key is " + key + "\n")
                ct = data.count(key)
                words[key] = ct
            if action:
                action(filepath, words)
            
                
                

def print_summary(filepath, words):
    for key, val in sorted(words.items()):
        whichout = out1 if val > 0 else out2
        print(filepath, file=whichout)
        print('{0}: {1}'.format(key, val), file=whichout)

filepath = sys.argv[1]
keys = ["annuity", "settlement"]
words = dict.fromkeys(keys, 0)

count_words_in_dir(filepath, words, action=print_summary)

out1.close()
out2.close()

模糊匹配可以使用regex模块,通过pip install regex命令安装一次。

通过这个正则表达式模块,您可以使用任何表达式,通过 {e<=2} 后缀,您可以指定单词中可能出现的错误数以匹配正则表达式(一个错误是替换或插入或删除一个象征)。这也称为编辑距离或 Levenshtein distance.

例如,我编写了自己的函数来计算给定字符串中的单词。这个函数有 num_errors 参数指定给定单词匹配多少错误是可以的,我指定 num_errors = 3,但你可以将它设置为更高的错误率,但不要将它设置为非常高 否则文本中的任何词都将匹配任何参考词。

为了将句子拆分成单词,我使用了 re.split()

Try it online!

import regex as re
def count_words(text, words, *, num_errors = 3):
    we = ['(' + re.escape(e) + f'){{e<={num_errors}}}' for e in words]
    cnt = {e : 0 for e in words}
    for wt in re.split(r'[,.\s]+', text):
        for wre, wrt in zip(we, words):
            if re.fullmatch(wre, wt):
                cnt[wrt] += 1
                break
    return cnt

text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))

输出:

{'settlement': 3, 'annuity': 2}

作为正则表达式模块的更快替代品,您可以使用 Levenshtein 模块,通过 pip install python-Levenshtein 命令安装一次。

这个模块只实现了编辑距离(上面提到的)并且应该比正则表达式模块工作得更快。

与上面相同但使用 Levenshtein 模块实现的代码如下:

Try it online!

import Levenshtein, re
def count_words(text, words, *, num_errors = 3):
    cnt = {e : 0 for e in words}
    for wt in re.split(r'[,.\s]+', text):
        for wr in words:
            if Levenshtein.distance(wr, wt) <= num_errors:
                cnt[wr] += 1
                break
    return cnt

text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))

输出:

{'settlement': 3, 'annuity': 2}

按照 OP 的要求,我正在实施第三种算法,该算法不使用任何 re.split() 来拆分成单词,而是使用 re.finditer()

Try it online!

import regex as re
def count_words(text, words, *, num_errors = 3):
    we = ['(' + re.escape(e) + f'){{e<={num_errors}}}' for e in words]
    cnt = {e : 0 for e in words}
    for wre, wrt in zip(we, words):
        cnt[wrt] += len(list(re.finditer(wre, text)))
    return cnt

text = 'settlement settl#7*nt se##tl#ment ann&&ity annuity hello world.'
print(count_words(text, ['settlement', 'annuity']))

输出:

{'settlement': 3, 'annuity': 2}