编写一个计算文件中单词首字母出现频率的作业。因此，如果有三个以 "c" 开头的单词，答案将是 "c 3"

Question

我有以下代码并获取了字数，但获取了所有单词的首字母频率我不明白该怎么做。如果文件中有三个以 C 开头的单词，我希望结果为“C 3”。我不需要区分大小写，所以 'a' 和 'A' 将被计算相同。

from mrjob.job import MRJob

class Job(MRJob):
    def mapper(self,Key, value):
     
        for char in value.strip().split():
            yield char, 1
    def reducer(self, Key, values):
    
        yield Key, sum(values)
if __name__ == '__main__':
    Job.run()

Answer 1

您可以更改 https://pypi.org/project/mrjob/ 上的默认示例：

"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))

将完整（小写）单词完成此操作

"""The changed MapReduce job: count the frequency of words
starting with the same (case insensitive) letter."""
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")    

class MyWordCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word[0].lower(), 1)      # use the 1st letter, lowercased

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))


if __name__ == '__main__':
     MyWordCount.run()

将其保存为 my_word_count.py 并像这样启动它：

python my_word_count README.rst > counts.txt

然后在counts.txt

中找到结果

编写一个计算文件中单词首字母出现频率的作业。因此，如果有三个以 "c" 开头的单词，答案将是 "c 3"

Write a job that counts the frequencies of word first letters in a file. So if there are three words starting with "c" answer would be "c 3"

python

mrjob