我的代码确实对小样本执行但不对大样本执行

Question

我尝试计算变量中单词出现的频率。变量计数超过 700.000 个观察值。输出应该 return 一个包含出现次数最多的单词的字典。我使用下面的代码来做到这一点：

d1 = {}
for i in range(len(words)-1):
    x=words[i]
    c=0
    for j in range(i,len(words)):
        c=words.count(x)
    count=dict({x:c})
    if x not in d1.keys():
        d1.update(count)

我已经运行编写了前 1000 次观察的代码，并且运行良好。输出如下所示：

[('semantic', 23),
 ('representations', 11),
 ('models', 10),
 ('task', 10),
 ('data', 9),
 ('parser', 9),
 ('language', 8),
 ('languages', 8),
 ('paper', 8),
 ('meaning', 8),
 ('rules', 8),
 ('results', 7),
 ('performance', 7),
 ('parsing', 7),
 ('systems', 7),
 ('neural', 6),
 ('tasks', 6),
 ('entailment', 6),
 ('generic', 6),
 ('te', 6),
 ('natural', 5),
 ('method', 5),
 ('approaches', 5)]

当我尝试运行它进行 100.000 次观察时，它保持运行ning。我已经试了24个多小时了，还是不执行。有人有想法吗？

Answer 1

您可以使用 collections.Counter.

from collections import Counter

counts = Counter(words)
print(counts.most_common(20))

Answer 2

@Jon 的答案对你来说是最好的，但在某些情况下 collections.counter 会比迭代慢。（特别是如果之后你不需要按频率排序）正如我在

中所问

您可以通过迭代计算频率。

d1 = {}
for item in words:
  if item in d1.keys():
    d1[item] += 1
  else:
    d1[item] = 1

# finally sort the dictionary of frequencies
print(dict(sorted(d1.items(), key=lambda item: item[1])))

但同样，对于您的情况，使用@Jon answer 更快更紧凑。

Answer 3

#...
for i in range(len(words)-1):
    #...
    #...
    for j in range(i,len(words)):
        c=words.count(x)
    #...
    if x not in d1.keys():
        #...

我试图强调您的代码在上面遇到的问题。在英语中，这看起来像：

“计算我正在查看的单词之后的每个单词出现的次数，重复计算整个列表中的每个单词。另外，查看我正在构建的整个词典再次为列表中的每个单词，同时我正在构建它。"

这比您需要做的工作要多得多；您只需要查看列表中的每个单词一次。您 do 需要为每个单词查找一次字典，但是查看 d1.keys() 将字典转换为另一个列表并查看整个内容会使速度变慢。下面的代码将做你想做的，速度更快：

words = ['able', 'baker', 'charlie', 'dog', 'easy', 'able', 'charlie', 'dog', 'dog']

word_counts = {}

# Look at each word in our list once
for word in words:
    # If we haven't seen it before, create a new count in our dictionary
    if word not in word_counts:
        word_counts[word] = 0

    # We've made sure our count exists, so just increment it by 1
    word_counts[word] += 1

print(word_counts.items())

上面的例子将给出：

[
    ('charlie', 2),
    ('baker', 1),
    ('able', 2),
    ('dog', 3),
    ('easy', 1)
]

我的代码确实对小样本执行但不对大样本执行

My code does execute for small sample but not for a large

python

large-data

word-count

find-occurrences