与 sort() 排序不一致
Inconsistent sorting with sort()
我有以下函数来计算字符串中的单词并提取顶部 "n":
函数
def count_words(s, n):
"""Return the n most frequently occuring words in s."""
#Split words into list
wordlist = s.split()
#Count words
counts = Counter(wordlist)
#Get top n words
top_n = counts.most_common(n)
#Sort by first element, if tie by second
top_n.sort(key=lambda x: (-x[1], x[0]))
return top_n
所以它按出现次数排序,如果有联系,则按字母顺序排序。
以下示例:
print count_words("cat bat mat cat cat mat mat mat bat bat cat", 3)
有效(显示[('cat', 4), ('mat', 4), ('bat', 3)]
)
print count_words("betty bought a bit of butter but the butter was bitter", 3)
不起作用(显示 [('butter', 2), ('a', 1), ('bitter', 1)]
但应该有 betty
而不是 bitter
因为它们是并列的 be...
在bi...
)
之前
print count_words("betty bought a bit of butter but the butter was bitter", 6)
有效(按预期显示 [('butter', 2), ('a', 1), ('betty', 1), ('bitter', 1), ('but', 1), ('of', 1)]
和 bitter
之前的 betty
)
是什么原因造成的(可能是字长?)我该如何解决?
问题不在于 sort
调用,而在于 most_common
。 Counter
实现为散列 table,因此它使用的顺序是 任意 。当您要求 most_common(n)
时,它会 return n
最常用的词,如果有联系,它会任意决定 return!
解决这个问题的最简单方法是避免使用 most_common
并直接使用列表:
top_n = sorted(counts.items(), key=lambda x: (-x[1], x[0]))[:n]
您要求的是前 3 名,因此您在可以按特定排序顺序挑选项目之前就剪切了数据。
而不是 most_common()
预排序然后重新排序,使用 heapq
按您的自定义标准排序(前提是 n
小于实际桶数) :
import heapq
def count_words(s, n):
"""Return the n most frequently occuring words in s."""
counts = Counter(s.split())
key = lambda kv: (-kv[1], kv[0])
if n >= len(counts):
return sorted(counts.items(), key=key)
return heapq.nsmallest(n, counts.items(), key=key)
在 Python 2 上,您可能希望对上述调用使用 iteritems()
而不是 items()
。
这将重新创建 Counter.most_common()
method,但使用更新的密钥。与原始版本一样,使用 heapq
确保这与 O(NlogK) 性能相关,而不是 O(NlogN)(N
桶的数量,K 是您想要查看的顶部元素数).
演示:
>>> count_words("cat bat mat cat cat mat mat mat bat bat cat", 3)
[('cat', 4), ('mat', 4), ('bat', 3)]
>>> count_words("betty bought a bit of butter but the butter was bitter", 3)
[('butter', 2), ('a', 1), ('betty', 1)]
>>> count_words("betty bought a bit of butter but the butter was bitter", 6)
[('butter', 2), ('a', 1), ('betty', 1), ('bit', 1), ('bitter', 1), ('bought', 1)]
以及快速性能比较(在 Python 3.6.0b1 上):
>>> from collections import Counter
>>> from heapq import nsmallest
>>> from random import choice, randrange
>>> from timeit import timeit
>>> from string import ascii_letters
>>> sentence = ' '.join([''.join([choice(ascii_letters) for _ in range(randrange(3, 15))]) for _ in range(1000)])
>>> counts = Counter(sentence) # count letters
>>> len(counts)
53
>>> key = lambda kv: (-kv[1], kv[0])
>>> timeit('sorted(counts.items(), key=key)[:3]', 'from __main__ import counts, key', number=100000)
2.119404911005404
>>> timeit('nsmallest(3, counts.items(), key=key)', 'from __main__ import counts, nsmallest, key', number=100000)
1.9657367869949667
>>> counts = Counter(sentence.split()) # count words
>>> len(counts)
1000
>>> timeit('sorted(counts.items(), key=key)[:3]', 'from __main__ import counts, key', number=10000) # note, 10 times fewer
6.689963405995513
>>> timeit('nsmallest(3, counts.items(), key=key)', 'from __main__ import counts, nsmallest, key', number=10000)
2.902360848005628
您可以通过执行 .most_common()
,然后对结果进行排序和切片来修复它,而不是将 n
提供给 most_common
:
def count_words(s, n):
"""Return the n most frequently occuring words in s."""
#Split words into list
wordlist = s.split()
#Count words
counts = Counter(wordlist)
#Sort by frequency
top = counts.most_common()
#Sort by first element, if tie by second
top.sort(key=lambda x: (-x[1], x[0]))
return top[:n]
我有以下函数来计算字符串中的单词并提取顶部 "n":
函数
def count_words(s, n):
"""Return the n most frequently occuring words in s."""
#Split words into list
wordlist = s.split()
#Count words
counts = Counter(wordlist)
#Get top n words
top_n = counts.most_common(n)
#Sort by first element, if tie by second
top_n.sort(key=lambda x: (-x[1], x[0]))
return top_n
所以它按出现次数排序,如果有联系,则按字母顺序排序。 以下示例:
print count_words("cat bat mat cat cat mat mat mat bat bat cat", 3)
有效(显示[('cat', 4), ('mat', 4), ('bat', 3)]
)
print count_words("betty bought a bit of butter but the butter was bitter", 3)
不起作用(显示 [('butter', 2), ('a', 1), ('bitter', 1)]
但应该有 betty
而不是 bitter
因为它们是并列的 be...
在bi...
)
print count_words("betty bought a bit of butter but the butter was bitter", 6)
有效(按预期显示 [('butter', 2), ('a', 1), ('betty', 1), ('bitter', 1), ('but', 1), ('of', 1)]
和 bitter
之前的 betty
)
是什么原因造成的(可能是字长?)我该如何解决?
问题不在于 sort
调用,而在于 most_common
。 Counter
实现为散列 table,因此它使用的顺序是 任意 。当您要求 most_common(n)
时,它会 return n
最常用的词,如果有联系,它会任意决定 return!
解决这个问题的最简单方法是避免使用 most_common
并直接使用列表:
top_n = sorted(counts.items(), key=lambda x: (-x[1], x[0]))[:n]
您要求的是前 3 名,因此您在可以按特定排序顺序挑选项目之前就剪切了数据。
而不是 most_common()
预排序然后重新排序,使用 heapq
按您的自定义标准排序(前提是 n
小于实际桶数) :
import heapq
def count_words(s, n):
"""Return the n most frequently occuring words in s."""
counts = Counter(s.split())
key = lambda kv: (-kv[1], kv[0])
if n >= len(counts):
return sorted(counts.items(), key=key)
return heapq.nsmallest(n, counts.items(), key=key)
在 Python 2 上,您可能希望对上述调用使用 iteritems()
而不是 items()
。
这将重新创建 Counter.most_common()
method,但使用更新的密钥。与原始版本一样,使用 heapq
确保这与 O(NlogK) 性能相关,而不是 O(NlogN)(N
桶的数量,K 是您想要查看的顶部元素数).
演示:
>>> count_words("cat bat mat cat cat mat mat mat bat bat cat", 3)
[('cat', 4), ('mat', 4), ('bat', 3)]
>>> count_words("betty bought a bit of butter but the butter was bitter", 3)
[('butter', 2), ('a', 1), ('betty', 1)]
>>> count_words("betty bought a bit of butter but the butter was bitter", 6)
[('butter', 2), ('a', 1), ('betty', 1), ('bit', 1), ('bitter', 1), ('bought', 1)]
以及快速性能比较(在 Python 3.6.0b1 上):
>>> from collections import Counter
>>> from heapq import nsmallest
>>> from random import choice, randrange
>>> from timeit import timeit
>>> from string import ascii_letters
>>> sentence = ' '.join([''.join([choice(ascii_letters) for _ in range(randrange(3, 15))]) for _ in range(1000)])
>>> counts = Counter(sentence) # count letters
>>> len(counts)
53
>>> key = lambda kv: (-kv[1], kv[0])
>>> timeit('sorted(counts.items(), key=key)[:3]', 'from __main__ import counts, key', number=100000)
2.119404911005404
>>> timeit('nsmallest(3, counts.items(), key=key)', 'from __main__ import counts, nsmallest, key', number=100000)
1.9657367869949667
>>> counts = Counter(sentence.split()) # count words
>>> len(counts)
1000
>>> timeit('sorted(counts.items(), key=key)[:3]', 'from __main__ import counts, key', number=10000) # note, 10 times fewer
6.689963405995513
>>> timeit('nsmallest(3, counts.items(), key=key)', 'from __main__ import counts, nsmallest, key', number=10000)
2.902360848005628
您可以通过执行 .most_common()
,然后对结果进行排序和切片来修复它,而不是将 n
提供给 most_common
:
def count_words(s, n):
"""Return the n most frequently occuring words in s."""
#Split words into list
wordlist = s.split()
#Count words
counts = Counter(wordlist)
#Sort by frequency
top = counts.most_common()
#Sort by first element, if tie by second
top.sort(key=lambda x: (-x[1], x[0]))
return top[:n]