python 带排序频率的单词计数器

Question

我正在尝试读取一个文本文件，然后打印出所有最常用词在顶部的词，随着它在列表中的下降而减少。我有 Python 3.3.2.

def wordCounter(thing):
# Open a file
    file = open(thing, "r+")
    newWords={}
    for words in file.read().split():
        if words not in newWords:
            newWords[words] = 1
        else:
            newWords[words] += 1

    for k,v in frequency.items():
        print (k, v)
    file.close()

现在，它确实以我想要的/方式/打印出所有内容，但有些词的使用频率高于列表中靠后的其他词。我试过使用 newWords.sort()，但它说：

"AttributeError: 'dict' object has no attribute 'sort'"

所以我手足无措，因为我的知识非常有限

Answer 1

首先打印最常用的词：

from operator import itemgetter

for k, v in sorted(frequency.items(), key=itemgetter(1), reverse=True):
    print(k, v)

key是一个用于排序的函数。在我们的例子中，itemgetter 检索值，即频率作为排序标准。

没有导入的替代方案：

for k, v in sorted(frequency.items(), key=lambda x: x[1], reverse=True):
    print(k, v)

Answer 2

你可以试试这个方法：

from collections import Counter

with open('file_name.txt') as f:
    c=Counter(f.read().split())
    print c.most_common()

Answer 3

不要重新发明轮子 collections.Counter 将使用 .most_common 进行计数和排序，这将为您提供最常用到最不常用的单词：

from collections import Counter
def wordCounter(thing):
   with open(thing) as f:
       cn = Counter(w for line in f for w in line.split())
       return cn.most_common()

您也不需要将整个文件读入内存，您可以逐行迭代并拆分每一行。您还必须考虑标点符号，您可以使用 str.strip:

将其删除

def wordCounter(thing):
    from string import punctuation
    with open(thing) as f:
        cn = Counter(w.strip(punctuation) for line in f for w in line.split())
        return cn.most_common()

Answer 4

字典没有 sort() 方法。但是，您可以将字典传递给内置函数 sorted()，它将生成字典键的 list。使用具有 returns 字典键值的函数的排序键，即 get() 方法。

for key in sorted(newWords, key=newWords.get):
    print(key, newWords[key])

此外，您似乎一直在进行一些重构，因为您的代码中未定义 frequency。

Answer 5

如果你想在没有任何导入的情况下进行排序：

word_count = sorted(new_words.items(), key=lambda x: x[1], reverse=True)

注意：使用正则表达式打印出所有单词是更好的方法：

import re
from collections import defaultdict

word_count = defaultdict(int)
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
file = open("file.txt", 'r')
for line in file:
   for word in pattern.findall(line):
                word_count[word] += 1

python 带排序频率的单词计数器

python word counter w/ sorted frequency

python

counter

frequency