我如何更改此缩减器代码以查找最长的单词(和长度)而不是查找单词的频率?
How can i change this reducer code to find the longest words (and the length) rather than it finding the frequency of the words?
REDUCER CODE 此代码从文本文件中查找单词的频率,我想知道如何更改它以查找文本文件中最长的单词并将它们打印出来,例如。 "最长的单词有13个字符。结果包括:"
import sys
results = {}
for line in sys.stdin:
word, frequency = line.strip().split('\t', 1)
results[word]=results.get(word,0) + int(frequency)
words = list(results.keys())
words.sort()
for word in words:
print(word,results[word])
MAPPER 代码
import sys
for line in sys.stdin:
for word in line.strip().split():
print (word , "1")
以我的建议为基础(循环单词,在变量中保持最长):
longest = ""
for line in something:
for word in line.lower().split():
if len(word.strip()) > len(longest):
longest = word.strip()
print("Longest word is:", longest, "with the length of:", len(longest))
如果你不想保留所有单词,那么你可以这样做:
longest = set()
max_length = 0
for line in sys.stdin:
for word in line.strip().split():
length = len(word)
if length > max_length:
max_length = length
longest = {word}
elif length == max_length:
longest.add(word)
print(longest)
如果你想保留它们,按长度分组,你可以使用 defaultdict
:
from collections import defaultdict
words_length = defaultdict(set)
for line in sys.stdin:
for word in line.strip().split():
words_length[len(word)].add(word)
print(words_length[max(words_length)])
REDUCER CODE 此代码从文本文件中查找单词的频率,我想知道如何更改它以查找文本文件中最长的单词并将它们打印出来,例如。 "最长的单词有13个字符。结果包括:"
import sys
results = {}
for line in sys.stdin:
word, frequency = line.strip().split('\t', 1)
results[word]=results.get(word,0) + int(frequency)
words = list(results.keys())
words.sort()
for word in words:
print(word,results[word])
MAPPER 代码
import sys
for line in sys.stdin:
for word in line.strip().split():
print (word , "1")
以我的建议为基础(循环单词,在变量中保持最长):
longest = ""
for line in something:
for word in line.lower().split():
if len(word.strip()) > len(longest):
longest = word.strip()
print("Longest word is:", longest, "with the length of:", len(longest))
如果你不想保留所有单词,那么你可以这样做:
longest = set()
max_length = 0
for line in sys.stdin:
for word in line.strip().split():
length = len(word)
if length > max_length:
max_length = length
longest = {word}
elif length == max_length:
longest.add(word)
print(longest)
如果你想保留它们,按长度分组,你可以使用 defaultdict
:
from collections import defaultdict
words_length = defaultdict(set)
for line in sys.stdin:
for word in line.strip().split():
words_length[len(word)].add(word)
print(words_length[max(words_length)])