检查两个字符串是否在 Python 中包含相同的单词集

Question

我正在尝试比较两个句子，看看它们是否包含同一组单词。
例如：比较 "today is a good day" 和 "is today a good day" 应该 return true
我现在正在使用 collections 模块中的 Counter 函数

from collections import Counter


vocab = {}
for line in file_ob:
    flag = 0
    for sentence in vocab:
        if Counter(sentence.split(" ")) == Counter(line.split(" ")):
            vocab[sentence]+=1
            flag = 1
            break
        if flag==0:
            vocab[line]=1

几行似乎工作正常，但我的文本文件有超过 1000 行，而且它永远不会执行完。有没有其他更有效的方法可以帮助我计算整个文件的结果？

编辑：

我只需要一个 Counter 方法的替代品，一些东西来替代它。实施方面没有任何变化。

Answer 1

试试

set(sentence.split(" ")) == set(line.split(" "))

比较 set 对象比比较 counter 更快。 set 和 counter 对象基本上都是 set，但是当您使用 counter 对象进行比较时，它必须同时比较键和值，而 set 只需要比较键。
感谢 Eric 和 Barmar 的意见。

您的完整代码如下所示

from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}
for line in file_ob:
    for sentence in vocab:
        if set(sentence.split(" ")) == set(line.split(" ")):
            vocab[sentence]+=1

Answer 2

要考虑 duplicate/multiple 个词，您的相等比较可能是：

def hash_sentence(s):                                                                                                                                                                                                                                         
    return hash(''.join(sorted(s.split())))                                                                                                                                                                                                                   

a = 'today is a good day'                                                                                                                                                                                                                                     
b = 'is today a good day'                                                                                                                                                                                                                                     
c = 'today is a good day is a good day'                                                                                                                                                                                                                       

hash_sentence(a) == hash_sentence(b)  # True
hash_sentence(a) == hash_sentence(c)  # False

此外，请注意，在您的实施中，每个句子都被计算在内 n-times (for sentence in vocab:)。

Answer 3

在您的代码中，您可以将 Counter 构造提取到内部循环之外，而不是为每对重新计算每个 - 这应该通过与每个字符串的平均标记数成比例的因子改进算法。

from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}

vocab_counter = {k: Counter(k.split(" ")) for k in vocab.keys() }

for line in file_obj:
    line_counter = Counter(line.split(" "))
    for sentence in vocab:
        if vocab_counter[sentence] == line_counter:
            vocab[sentence]+=1

可以通过使用计数器作为字典的索引来进一步改进，这样您就可以用查找来代替线性搜索来匹配句子。 frozendict 包可能会有用，这样您就可以将一个字典用作另一个字典的键。

Answer 4

你真的不需要使用两个循环。

听写的正确使用方法

假设您有 dict:

my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 5, 'g': 6}

你的代码基本上等同于：

for (key, value) in my_dict.items():
    if key == 'c':
        print(value)
        break
#=> 3

但是dict（和set、Counter、...）的重点是能够直接得到想要的值：

my_dict['c']
#=> 3

如果您的字典有 1000 个值，则第一个示例平均比第二个示例慢 500 倍。这是我在 Reddit:

上找到的简单描述

A dict is like a magic coat check room. You hand your coat over and get a ticket. Whenever you give that ticket back, you immediately get your coat. You can have a lot of coats, but you still get your coat back immediately. There is a lot of magic going on inside the coat check room, but you don't really care as long as you get your coat back immediately.

重构代码

你只需要在"Today is a good day!"和"Is today a good day?"之间找到一个共同的签名即可。一种方法是提取单词，将它们转换为小写，对它们进行排序并加入它们。重要的是输出应该是不可变的（例如 tuple、string、frozenset）。这样，它可以在集合、计数器或字典中直接使用，而不需要遍历每个键。

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

vocab = Counter()
for sentence in sentences:
    sorted_words = ' '.join(sorted(sentence.lower().split(" ")))
    vocab[sorted_words] += 1

vocab
#=> # Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

甚至更短：

from collections import Counter

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = Counter(sorted_words(sentence) for sentence in sentences)
# Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})

此代码应该比您迄今为止尝试的代码快得多。

另一种选择

如果你想把原句保留在一个列表中，你可以使用setdefault :

sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]

def sorted_words(sentence):
    return ' '.join(sorted(sentence.lower().split(" ")))

vocab = {}
for sentence in sentences:
    vocab.setdefault(sorted_words(sentence), []).append(sentence)

vocab

#=> {'a day good is today': ['Today is a good day', 'Is today a good day'],
# 'a b c': ['a b c', 'c b a'],
# 'a a b c': ['a a b c']}

检查两个字符串是否在 Python 中包含相同的单词集

Check if two strings contain the same set of words in Python

python

text

text-extraction

python-2.7

听写的正确使用方法

重构代码

另一种选择