使用 Python 查找具有受限字母字符的相似字符串

Question

我想对相似的字符串进行分组，但是，我更愿意聪明地捕捉像“/”或“-”这样的约定是否有差异而不是字母差异。

给定以下输入：

moose
mouse
mo/os/e
m.ouse

alpha = ['/','.']

我想根据受限制的字母集对字符串进行分组，输出应为：

moose
mo/os/e

mouse
m.ouse

我知道我可以使用 difflib 获得类似的字符串，但它不提供限制字母表的选项。还有另一种方法吗？谢谢。

更新：

与受限字母不同，alpha 更易于实现，只需检查出现的次数即可。因此，我更改了标题。

Answer 1

可能是这样的：

from collections import defaultdict

container = defaultdict(list)
for word in words:
    container[''.join(item for item in word if item not in alpha)].append(word)

Answer 2

这是一个只需几个（简单）步骤的想法：

import re
example_strings = ['m/oose', 'moose', 'mouse', 'm.ouse', 'ca...t', 'ca..//t', 'cat']

1。索引所有字符串，以便稍后通过索引引用它们：

indexed_strings = list(enumerate(example_strings))

2。使用索引作为键，字符串作为值，将所有包含受限字符的字符串存储在字典中。然后暂时去掉限制字符进行排序：

# regex to match restricted alphabet
restricted = re.compile('[/\.]')
# dictionary to store strings with restricted char
restricted_dict = {}
for (idx, string) in indexed_strings:
    if restricted.search(string):
        # storing the string with a restricted char by its index
        restricted_dict[idx] = string
        # stripping the restricted char temporarily and returning to the list
        indexed_strings[idx] = (idx, restricted.sub('', string))

3。按字符串值对清理后的字符串列表进行排序，然后再次遍历字符串并将剥离的字符串替换为其原始值：

indexed_strings.sort(key=lambda x: x[1])
# make a new list for the final set of strings
final_strings = []
for (idx, string) in indexed_strings:
    if idx in restricted_dict:
        final_strings.append(restricted_dict[idx])
    else:
        final_strings.append(string)

结果：['ca...t', 'ca..//t', 'cat', 'm/oose', 'moose', 'mouse', 'm.ouse']

Answer 3

既然你想对单词进行分组，你可能应该使用 groupby。

您只需要定义一个删除 alpha 个字符的函数（例如使用 str.translate），然后您可以应用 sort 和 groupby 到您的数据：

from itertools import groupby

words = ['moose', 'mouse', 'mo/os/e', 'm.ouse']
alpha = ['/','.']

alpha_table = str.maketrans('', '', ''.join(alpha))

def remove_alphas(word):
    return word.lower().translate(alpha_table)

words.sort(key=remove_alphas)
print(words)
# ['moose', 'mo/os/e', 'mouse', 'm.ouse'] # <- Words are sorted correctly.

for common_word, same_words in groupby(words, remove_alphas):
    print(common_word)
    print(list(same_words))
# moose
# ['moose', 'mo/os/e']
# mouse
# ['mouse', 'm.ouse']

使用 Python 查找具有受限字母字符的相似字符串

Finding similar strings with restricted alpha characters using Python

python

similarity

difflib

levenshtein-distance

1。索引所有字符串，以便稍后通过索引引用它们：

2。使用索引作为键，字符串作为值，将所有包含受限字符的字符串存储在字典中。然后暂时去掉限制字符进行排序：

3。按字符串值对清理后的字符串列表进行排序，然后再次遍历字符串并将剥离的字符串替换为其原始值：