检查两个字符串是否在 Python 中包含相同的单词集
Check if two strings contain the same set of words in Python
我正在尝试比较两个句子,看看它们是否包含同一组单词。
例如:比较 "today is a good day" 和 "is today a good day" 应该 return true
我现在正在使用 collections 模块中的 Counter 函数
from collections import Counter
vocab = {}
for line in file_ob:
flag = 0
for sentence in vocab:
if Counter(sentence.split(" ")) == Counter(line.split(" ")):
vocab[sentence]+=1
flag = 1
break
if flag==0:
vocab[line]=1
几行似乎工作正常,但我的文本文件有超过 1000 行,而且它永远不会执行完。有没有其他更有效的方法可以帮助我计算整个文件的结果?
编辑:
我只需要一个 Counter 方法的替代品,一些东西来替代它。实施方面没有任何变化。
试试
set(sentence.split(" ")) == set(line.split(" "))
比较 set 对象比比较 counter 更快。 set 和 counter 对象基本上都是 set,但是当您使用 counter 对象进行比较时,它必须同时比较键和值,而 set 只需要比较键。
感谢 Eric 和 Barmar 的意见。
您的完整代码如下所示
from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}
for line in file_ob:
for sentence in vocab:
if set(sentence.split(" ")) == set(line.split(" ")):
vocab[sentence]+=1
要考虑 duplicate/multiple 个词,您的相等比较可能是:
def hash_sentence(s):
return hash(''.join(sorted(s.split())))
a = 'today is a good day'
b = 'is today a good day'
c = 'today is a good day is a good day'
hash_sentence(a) == hash_sentence(b) # True
hash_sentence(a) == hash_sentence(c) # False
此外,请注意,在您的实施中,每个句子都被计算在内 n-times (for sentence in vocab:
)。
在您的代码中,您可以将 Counter 构造提取到内部循环之外,而不是为每对重新计算每个 - 这应该通过与每个字符串的平均标记数成比例的因子改进算法。
from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}
vocab_counter = {k: Counter(k.split(" ")) for k in vocab.keys() }
for line in file_obj:
line_counter = Counter(line.split(" "))
for sentence in vocab:
if vocab_counter[sentence] == line_counter:
vocab[sentence]+=1
可以通过使用计数器作为字典的索引来进一步改进,这样您就可以用查找来代替线性搜索来匹配句子。 frozendict
包可能会有用,这样您就可以将一个字典用作另一个字典的键。
你真的不需要使用两个循环。
听写的正确使用方法
假设您有 dict
:
my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 5, 'g': 6}
你的代码基本上等同于:
for (key, value) in my_dict.items():
if key == 'c':
print(value)
break
#=> 3
但是dict
(和set
、Counter
、...)的重点是能够直接得到想要的值:
my_dict['c']
#=> 3
如果您的字典有 1000 个值,则第一个示例平均比第二个示例慢 500 倍。这是我在 Reddit:
上找到的简单描述
A dict is like a magic coat check room. You hand your coat over and
get a ticket. Whenever you give that ticket back, you immediately get
your coat. You can have a lot of coats, but you still get your coat
back immediately. There is a lot of magic going on inside the coat
check room, but you don't really care as long as you get your coat
back immediately.
重构代码
你只需要在"Today is a good day!"
和"Is today a good day?"
之间找到一个共同的签名即可。一种方法是提取单词,将它们转换为小写,对它们进行排序并加入它们。重要的是输出应该是不可变的(例如 tuple
、string
、frozenset
)。这样,它可以在集合、计数器或字典中直接使用,而不需要遍历每个键。
from collections import Counter
sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]
vocab = Counter()
for sentence in sentences:
sorted_words = ' '.join(sorted(sentence.lower().split(" ")))
vocab[sorted_words] += 1
vocab
#=> # Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})
甚至更短:
from collections import Counter
sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]
def sorted_words(sentence):
return ' '.join(sorted(sentence.lower().split(" ")))
vocab = Counter(sorted_words(sentence) for sentence in sentences)
# Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})
此代码应该比您迄今为止尝试的代码快得多。
另一种选择
如果你想把原句保留在一个列表中,你可以使用setdefault
:
sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]
def sorted_words(sentence):
return ' '.join(sorted(sentence.lower().split(" ")))
vocab = {}
for sentence in sentences:
vocab.setdefault(sorted_words(sentence), []).append(sentence)
vocab
#=> {'a day good is today': ['Today is a good day', 'Is today a good day'],
# 'a b c': ['a b c', 'c b a'],
# 'a a b c': ['a a b c']}
我正在尝试比较两个句子,看看它们是否包含同一组单词。
例如:比较 "today is a good day" 和 "is today a good day" 应该 return true
我现在正在使用 collections 模块中的 Counter 函数
from collections import Counter
vocab = {}
for line in file_ob:
flag = 0
for sentence in vocab:
if Counter(sentence.split(" ")) == Counter(line.split(" ")):
vocab[sentence]+=1
flag = 1
break
if flag==0:
vocab[line]=1
几行似乎工作正常,但我的文本文件有超过 1000 行,而且它永远不会执行完。有没有其他更有效的方法可以帮助我计算整个文件的结果?
编辑:
我只需要一个 Counter 方法的替代品,一些东西来替代它。实施方面没有任何变化。
试试
set(sentence.split(" ")) == set(line.split(" "))
比较 set 对象比比较 counter 更快。 set 和 counter 对象基本上都是 set,但是当您使用 counter 对象进行比较时,它必须同时比较键和值,而 set 只需要比较键。
感谢 Eric 和 Barmar 的意见。
您的完整代码如下所示
from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}
for line in file_ob:
for sentence in vocab:
if set(sentence.split(" ")) == set(line.split(" ")):
vocab[sentence]+=1
要考虑 duplicate/multiple 个词,您的相等比较可能是:
def hash_sentence(s):
return hash(''.join(sorted(s.split())))
a = 'today is a good day'
b = 'is today a good day'
c = 'today is a good day is a good day'
hash_sentence(a) == hash_sentence(b) # True
hash_sentence(a) == hash_sentence(c) # False
此外,请注意,在您的实施中,每个句子都被计算在内 n-times (for sentence in vocab:
)。
在您的代码中,您可以将 Counter 构造提取到内部循环之外,而不是为每对重新计算每个 - 这应该通过与每个字符串的平均标记数成比例的因子改进算法。
from collections import Counter
vocab = {a dictionary of around 1000 sentences as keys}
vocab_counter = {k: Counter(k.split(" ")) for k in vocab.keys() }
for line in file_obj:
line_counter = Counter(line.split(" "))
for sentence in vocab:
if vocab_counter[sentence] == line_counter:
vocab[sentence]+=1
可以通过使用计数器作为字典的索引来进一步改进,这样您就可以用查找来代替线性搜索来匹配句子。 frozendict
包可能会有用,这样您就可以将一个字典用作另一个字典的键。
你真的不需要使用两个循环。
听写的正确使用方法
假设您有 dict
:
my_dict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 5, 'g': 6}
你的代码基本上等同于:
for (key, value) in my_dict.items():
if key == 'c':
print(value)
break
#=> 3
但是dict
(和set
、Counter
、...)的重点是能够直接得到想要的值:
my_dict['c']
#=> 3
如果您的字典有 1000 个值,则第一个示例平均比第二个示例慢 500 倍。这是我在 Reddit:
上找到的简单描述A dict is like a magic coat check room. You hand your coat over and get a ticket. Whenever you give that ticket back, you immediately get your coat. You can have a lot of coats, but you still get your coat back immediately. There is a lot of magic going on inside the coat check room, but you don't really care as long as you get your coat back immediately.
重构代码
你只需要在"Today is a good day!"
和"Is today a good day?"
之间找到一个共同的签名即可。一种方法是提取单词,将它们转换为小写,对它们进行排序并加入它们。重要的是输出应该是不可变的(例如 tuple
、string
、frozenset
)。这样,它可以在集合、计数器或字典中直接使用,而不需要遍历每个键。
from collections import Counter
sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]
vocab = Counter()
for sentence in sentences:
sorted_words = ' '.join(sorted(sentence.lower().split(" ")))
vocab[sorted_words] += 1
vocab
#=> # Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})
甚至更短:
from collections import Counter
sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]
def sorted_words(sentence):
return ' '.join(sorted(sentence.lower().split(" ")))
vocab = Counter(sorted_words(sentence) for sentence in sentences)
# Counter({'a day good is today': 2, 'a b c': 2, 'a a b c': 1})
此代码应该比您迄今为止尝试的代码快得多。
另一种选择
如果你想把原句保留在一个列表中,你可以使用setdefault
:
sentences = ["Today is a good day", 'a b c', 'a a b c', 'c b a', "Is today a good day"]
def sorted_words(sentence):
return ' '.join(sorted(sentence.lower().split(" ")))
vocab = {}
for sentence in sentences:
vocab.setdefault(sorted_words(sentence), []).append(sentence)
vocab
#=> {'a day good is today': ['Today is a good day', 'Is today a good day'],
# 'a b c': ['a b c', 'c b a'],
# 'a a b c': ['a a b c']}