使用 fuzzywuzzy 时如何更有效地比较字符串？

Question

我有一个包含约 20000 个单词的 CSV 文件，我想按相似性对这些单词进行分组。为了完成这样的任务，我使用了很棒的 fuzzywuzzy 包，它似乎工作得很好，并且用一个小数据集（~100 字）

实现了我正在寻找的东西

这些词实际上是品牌名称，这是我刚才提到的小型数据集的示例输出，其中我得到了按名称分组的相似品牌：

[
    ('asos-design', 'asos'), 
    ('m-and-s', 'm-and-s-collection'), 
    ('polo-ralph-lauren', 'ralph-lauren'), 
    ('hugo-boss', 'boss'), 
    ('yves-saint-laurent', 'saint-laurent')
]

现在，我的问题是，如果我运行我当前的完整数据集代码，它真的很慢，而且我真的不知道如何提高性能，或者如何在不使用 2 个 for 循环的情况下执行此操作。

这是我的代码。

import csv
from fuzzywuzzy import fuzz

THRESHOLD = 90

possible_matches = []


with open('words.csv', encoding='utf-8') as csvfile:
    words = []
    reader = csv.reader(csvfile)
    for row in reader:
        word, x, y, *rest = row
        words.append(word)

    for i in range(len(words)-1):
        for j in range(i+1, len(words)): 
            if fuzz.token_set_ratio(words[i], words[j]) >= THRESHOLD:
                possible_matches.append((words[i], words[j]))

        print(i)
    print(possible_matches)

如何提高性能？

Answer 1

尝试使用列表理解，它比 list.append() 方法更快：

with open('words.csv', encoding='utf-8') as csvfile:
    words = [row[0] for row in csv.reader(csvfile)]

    possible_matches = [(words[i], words[j]) for i in range(len(words)-1) for j in range(i+1, len(words)) if fuzz.token_set_ratio(words[i], words[j]) >= THRESHOLD]

    print(possible_matches)

不幸的是，用这种方式你不能在每次迭代中都执行 print(i)，但假设你只需要 print(i) 进行调试，它不会影响你的最终结果。

将循环转换为列表理解非常容易，假设您有这样一个循环：

for i in iterable_1:
    lst.append(something)

列表理解变为：

lst = [something for i in iterable_1]

对于嵌套循环和条件，只要遵循相同的逻辑即可：

iterable_1:
    iterable_2:
        ...
        some_condition:
            lst.append(something)

# becomes

lst = [something <iterable_1> <iterable_2> ... <some_condition>]

# Or if you have an else clause:

iterable_1:
    ...
    if some_condition:
        lst.append(something)
    else:
        lst.append(something_else)

lst = [something if some_condition else something_else <iterable_1> <iterable_2> ...]

Answer 2

对于 20,000 个单词或品牌，任何将每个单词与其他单词进行比较的方法（即具有二次复杂度 O(n²)）可能都太慢了。对于 20,000，它可能仍然勉强可以接受，但是对于任何更大的数据集，它很快就会崩溃。

相反，您可以尝试从您的单词中提取一些 "feature" 并相应地对它们进行分组。我的第一个想法是使用 stemmer，但由于您的单词是名称而不是 真实的 单词，因此这行不通。我不知道你的样本数据的代表性如何，但你可以尝试根据由 - 分隔的组件对单词进行分组，然后得到唯一的非平凡组，你就完成了。

words = ['asos-design', 'asos', 'm-and-s', 'm-and-s-collection', 
         'polo-ralph-lauren', 'ralph-lauren', 'hugo-boss', 'boss',
         'yves-saint-laurent', 'saint-laurent']

from collections import defaultdict
parts = defaultdict(list)
for word in words:
    for part in word.split("-"):
        parts[part].append(word)

result = set(tuple(group) for group in parts.values() if len(group) > 1)

结果：

{('asos-design', 'asos'),
 ('hugo-boss', 'boss'),
 ('m-and-s', 'm-and-s-collection'),
 ('polo-ralph-lauren', 'ralph-lauren'),
 ('yves-saint-laurent', 'saint-laurent')}

您可能还想先过滤掉一些 stop words，例如 and，或者将它们与周围的词放在一起。这可能仍会产生一些误报，例如polo 或 collection 之类的词可能会出现在几个不同的品牌中，但我认为使用 fuzzywuzzy 或类似词也是如此。可能需要对组进行一些 post 处理和手动过滤。

使用 fuzzywuzzy 时如何更有效地比较字符串？

How to compare strings more efficiently when using fuzzywuzzy?

python

csv

algorithm

fuzzywuzzy