Hadoop Streaming 不能 运行 python
Hadoop Streaming can't run python
我正在尝试使用 python 代码通过 mapreduce 执行 hadoop 流,但是,它总是给出相同的错误结果,
File: file:/C:/py-hadoop/map.py is not readable
或
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
我使用 hadoop 3.1.1 和 python 3.8,Windows 10os
这是我的 map reduce 命令行
hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file C:/py-hadoop/map.py,C:/py-hadoop/reduce.py -mapper "python map.py" -reducer "python reduce.py" -input /user/input -output /user/python-output
map.py
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print ("%s\t%s" % (word, 1))
reduce.py
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
clean = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
word = filter(lambda x: x in clean, word).lower()
if current_word == word:
current_count += count
else:
if current_word:
print ("%s\t%s" % (current_word, current_count))
current_count = count
current_word = word
if current_word == word:
print ("%s\t%s" % (current_word, current_count))
也已经尝试过不同的命令行,比如
hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file C:/py-hadoop/map.py -mapper "python map.py" -file C:/py-hadoop/reduce.py -reducer "python reduce.py" -input /user/input -output /user/python-output
和
hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file py-hadoop/map.py -mapper "python map.py" -file py-hadoop/reduce.py -reducer "python reduce.py" -input /user/input -output /user/python-output
但仍然给出完全相同的错误结果,
对不起,如果我的英语不好,我不是母语人士
已经修复了,
问题是由 reduce.py 引起的,这是我的新 reduce.py
import sys
import collections
counter = collections.Counter()
for line in sys.stdin:
word, count = line.strip().split("\t", 1)
counter[word] += int(count)
for x in counter.most_common(9999):
print(x[0],"\t",x[1])
这是我用来 运行
的命令行
hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file C:/py-hadoop/map.py -file C:/py-hadoop/reduce.py -mapper "python map.py" -reducer "python reduce.py" -input /user/input -output /user/python-output
我正在尝试使用 python 代码通过 mapreduce 执行 hadoop 流,但是,它总是给出相同的错误结果,
File: file:/C:/py-hadoop/map.py is not readable
或
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
我使用 hadoop 3.1.1 和 python 3.8,Windows 10os
这是我的 map reduce 命令行
hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file C:/py-hadoop/map.py,C:/py-hadoop/reduce.py -mapper "python map.py" -reducer "python reduce.py" -input /user/input -output /user/python-output
map.py
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print ("%s\t%s" % (word, 1))
reduce.py
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
clean = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
word = filter(lambda x: x in clean, word).lower()
if current_word == word:
current_count += count
else:
if current_word:
print ("%s\t%s" % (current_word, current_count))
current_count = count
current_word = word
if current_word == word:
print ("%s\t%s" % (current_word, current_count))
也已经尝试过不同的命令行,比如
hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file C:/py-hadoop/map.py -mapper "python map.py" -file C:/py-hadoop/reduce.py -reducer "python reduce.py" -input /user/input -output /user/python-output
和
hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file py-hadoop/map.py -mapper "python map.py" -file py-hadoop/reduce.py -reducer "python reduce.py" -input /user/input -output /user/python-output
但仍然给出完全相同的错误结果,
对不起,如果我的英语不好,我不是母语人士
已经修复了, 问题是由 reduce.py 引起的,这是我的新 reduce.py
import sys
import collections
counter = collections.Counter()
for line in sys.stdin:
word, count = line.strip().split("\t", 1)
counter[word] += int(count)
for x in counter.most_common(9999):
print(x[0],"\t",x[1])
这是我用来 运行
的命令行hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file C:/py-hadoop/map.py -file C:/py-hadoop/reduce.py -mapper "python map.py" -reducer "python reduce.py" -input /user/input -output /user/python-output