在Python中如何生成聚合到不同级别的频率词计数器?

In Python how can I generate a frequency word counter that aggreagtes to different levels?

我见过其他 python 字数统计器,它们读取 CSV 文件并给出整个专栏的字数统计。我想查看每行的单词数,但我希望它位于 "project" 和 "sub-project" 级别(我数据中的其他列)。这样我就可以看到一个子项目的字数是否比另一个子项目的特定字数高。我希望最后一列是:项目、子项目、单词、字数(每个子项目,而不是总计)。如果有任何帮助,我将不胜感激!

输入:

列 - Project/Sub-project/Corpus

Project1/Sub 1/红车是最好的车

Project1/Sub 2/蓝色比较好

导出文档应为:

列 - Project/Sub-Project/Word/Frequency

Project1/Sub1/The/2

Project1/Sub2/The/1

这个程序可能会做你想做的事:

import csv
from collections import Counter

with open('in.csv') as in_file:
    in_file = csv.DictReader(in_file)

    with open('out.csv', 'w') as out_file:
        out_file = csv.DictWriter(
            out_file,
            ['Project', 'Sub-Project', 'Word', 'Frequency'])
        out_file.writeheader()

        for line in in_file:
            words = Counter(map(str.lower, line['Corpus'].split()))

            for word, freq in words.most_common():
                out_file.writerow({
                    'Project': line['Project'],
                    'Sub-Project': line['Sub-project'],
                    'Word': word,
                    'Frequency': freq})